Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and remove all duplicate pixels #38946

Merged
merged 3 commits into from Aug 4, 2022

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Aug 2, 2022

PR description:

Detect and remove all duplicate pixels, after unpacking each pixel module but before running the clustering.
Use shared memory for inter-thread communication and to speed up marking and detecting the duplicates.

PR validation:

Running the online pixel reconstruction and the full HLT menu on GPU over non-problematic events shows only a moderate slow down:

reconstruction pixel tracking full HLT
no duplicate removal 1566 ± 16 ev/s (--) 873 ± 4 ev/s (--)
duplicate removal with atomicOR (ce8a57b) 1530 ± 17 ev/s (-2.3%) 872 ± 4 ev/s (-0.2%)
duplicate removal with atomicCAS (2522012) 1519 ± 14 ev/s (-3.0%) 869 ± 2 ev/s (-0.4%)

If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 12.4.x for data taking (#38947).

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 2, 2022

type bugfix

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 2, 2022

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 2, 2022

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 2, 2022

urgent

@cmsbuild cmsbuild added the urgent label Aug 2, 2022
@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 2, 2022

@VinInn @AdrianoDee FYI

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 2, 2022

+heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38946/31388

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2022

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

  • DataFormats/SiPixelDigi (simulation)
  • HeterogeneousCore/CUDAUtilities (heterogeneous)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)

@jpata, @civanch, @clacaputo, @mdhildreth can you please review it and eventually sign? Thanks.
@mtosi, @VourMa, @makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@fwyzard fwyzard mentioned this pull request Aug 2, 2022
@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 2, 2022

Note: #37559 should be the equivalent implementation for the CPU-only reconstruction.

While the HLT can move forward without it (since we are running exclusively on GPUs), #37559 should be validated and backported to 12.4.x to keep the two implementation coherent.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 3, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-d29d53/26624/summary.html
COMMIT: 2522012
CMSSW: CMSSW_12_5_X_2022-08-03-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/38946/26624/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • Reco comparison had 3 failed jobs
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19876
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19868
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: found differences in 2 / 3 workflows

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3691510
  • DQMHistoTests: Total failures: 13
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3691474
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 50 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

@civanch
Copy link
Contributor

civanch commented Aug 3, 2022

+1

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 4, 2022

+heterogeneous

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 4, 2022

@clacaputo @jpata as this is something we would like to deploy online sooner rather than later, could you let me know if you have any concerns about, if you think we should involve directly the DPG, etc. ?

@clacaputo
Copy link
Contributor

@clacaputo @jpata as this is something we would like to deploy online sooner rather than later, could you let me know if you have any concerns about, if you think we should involve directly the DPG, etc. ?

No concerns from my side, just busy with other stuff. I'm going to sign it

@clacaputo
Copy link
Contributor

+reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 4, 2022

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 4, 2022

No concerns from my side, just busy with other stuff.

Sure, no problem!

I'm going to sign it

Thanks.

@qliphy
Copy link
Contributor

qliphy commented Aug 4, 2022

+1

@cmsbuild cmsbuild merged commit efa7f41 into cms-sw:master Aug 4, 2022
@fwyzard fwyzard deleted the gpu_duplicate_pixel_removal branch August 5, 2022 15:13
@fwyzard fwyzard restored the gpu_duplicate_pixel_removal branch August 5, 2022 15:14
@nothingface0
Copy link
Contributor

nothingface0 commented Aug 11, 2022

@fwyzard Could you explain why this way is more efficient than the one proposed by @VinInn here?

I can see that Vincenzo's way accesses the global memory (the x and y arrays) multiple times, and in the worst case those accesses are numElements words apart. Does this mean that there will probably be many cache misses?

Is your way more efficient due to the global memory access being sequential (even though it's done twice?)

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 11, 2022

hi @nothingface0,
the main difference is that this approach is using shared memory:

__shared__ uint32_t status[pixelStatusSize]; // packed words array used to store the PixelStatus of each pixel

Shared memory is local to the processor, like the L1 cache, so it is much faster than global memory.
The limitation is that there is only order of 48 - 64 kB available, so for example we cannot use the same approach for Phase-2, where the pixel modules are much larger.

@nothingface0
Copy link
Contributor

nothingface0 commented Aug 11, 2022

I totally understood the use of shared memory in your approach, but I'm still a bit confused, since, for example, Vincenzo's approach does not need an extra array for storing occurrences. So it's not like the shared memory is used for existing data but for extra data.

To make my question clearer:

Vincezo's approach: Each thread does a pairwise comparison of the coordinates of the pixel it is assigned, with every the coordinates of every other pixel, i.e. the first thread compares pixel 0 with 1, then 0 with 2.. up to 0 with msize-1, which are far apart in memory, meaning that this might lead to cache misses. Not to mention that this happens in parallel with every other thread trying to do its own comparisons. In general, this needs msize over 2 memory/cache accesses in total, whose pattern is not predictable.

Your approach: Use some extra (shared) memory to store occurrences of each pixel. To do that, you

  • Sweep once over the x and y arrays sequentially (i.e. predictable for the GPU) to count occurrences of each pixel (pixelStatus::promote(status, x[i], y[i]);)
  • Sweep a second time sequentially over the same arrays to check if they've been marked duplicates (if (pixelStatus::isDuplicate(status, x[i], y[i])))

This means that you access the global memory at most 2 x msize times predictably, meaning better use of the available cache.


TL;DR: Is this statement correct?:
Your approach is more efficient, not because of shared memory usage (since Vincenzo's method simply does not need an extra array), but because its global memory access patterns are more predictable and make better use of the GPU cache.

Sorry for insisting, I'm just trying to understand this statement:

to be clear. this solution is NOT computational-sustainable.

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 11, 2022

IIUC, the approach used by Vincenzo compares every pixel with every other pixel, so the complexity of the algorithm grows with the square of the number of pixels in a module, i.e. is O(N²).

The approach I used reads every pixel a fixed number of times, so the complexity grows linearly with the number of pixels in a module, i.e. is O(N).

The exact number of operations and memory accesses i not exactly N² or N, but those are the order of the leading terms in the two cases..

My approach does make use of an extra memory buffer, which normally would add a large cost; being able to keep that buffer in shared memory makes the cost acceptable. If you feel like it, it could be interesting to measure the impact of using a buffer in global memory instead (with a byte per pixel, without all the bitwise operations).

@nothingface0
Copy link
Contributor

nothingface0 commented Aug 11, 2022

Right, there's the complexity of the approach, too. Thanks!

measure the impact of using a buffer in global memory instead

I might ask you again on how to run profiling using patatrack 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants