Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace CUDA API wrapper memory operations with native CUDA calls #395

Merged
merged 15 commits into from
Oct 29, 2019
Merged

Replace CUDA API wrapper memory operations with native CUDA calls #395

merged 15 commits into from
Oct 29, 2019

Conversation

waredjeb
Copy link

@waredjeb waredjeb commented Oct 25, 2019

PR description

This PR is part of #386:

  • replace cuda::memory::copy() with cudaMemcpy(), cuda::memory::async::copy() with cudaMemcpyAsync()
  • replace cuda::memory::zero() and cuda::memory::set() with cudaMemset()
  • replace cuda::memory::async::zero() and cuda::memory::async::set() with cudaMemsetAsync()

PR validation

unit tests run

@@ -4,5 +4,5 @@

BeamSpotCUDA::BeamSpotCUDA(Data const* data_h, cuda::stream_t<>& stream) {
data_d_ = cudautils::make_device_unique<Data>(stream);
cuda::memory::async::copy(data_d_.get(), data_h, sizeof(Data), stream.id());
cudaMemcpyAsync(data_d_.get(), data_h, sizeof(Data), cudaMemcpyHostToDevice, stream.id());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've typically used cudaMemcpyDefault elsewhere (but I'm not against of denoting the direction explicitly).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So could I leave the cudaMemcpy with the direction defined?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So could I leave the cudaMemcpy with the direction defined?

I'm fine with that.

@makortel
Copy link

There are conflicts so this PR needs to be rebased (OTOH this PR conflicts also with #389, so it could be less work to wait until that one gets merged, I let @fwyzard to comment whether #389 could get in soon).

@fwyzard

This comment has been minimized.

@fwyzard fwyzard changed the title Replace cuda::memory[::async]::copy() with cudaMemcpy[Async](), cuda:… Replace CUDA API wrapper memory operations with native CUDA calls Oct 26, 2019
Copy link

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you

  • fix the spurious lines
  • fix the types used in the copies
  • not add/remove whitespaces and empty lines (unless it is done on purpose)
    ?

Then, one thing I forgot to ask you earlier: could you wrap every call to cudaMemcpy(...), cudaMemcpyAsync(...), cudaMemset(...), cudaMemsetAsync(...) in a call to cudaCheck() ?
For example

  cudaMemcpyAsync(data_d_.get(), data_h, sizeof(Data), cudaMemcpyHostToDevice, stream);

should become

  cudaCheck(cudaMemcpyAsync(data_d_.get(), data_h, sizeof(Data), cudaMemcpyHostToDevice, stream));

To make it available, you may need to add

#include "HeterogeneousCore/CUDAUtilities/interface/cudaCheck.h"

if it was not already there.

DataFormats/Math/test/CholeskyInvert_t.cu Outdated Show resolved Hide resolved
HeterogeneousCore/CUDAUtilities/test/copyAsync_t.cpp Outdated Show resolved Hide resolved
HeterogeneousCore/CUDAUtilities/test/copyAsync_t.cpp Outdated Show resolved Hide resolved
HeterogeneousCore/CUDAUtilities/test/copyAsync_t.cpp Outdated Show resolved Hide resolved
RecoLocalTracker/SiPixelClusterizer/test/gpuClustering_t.h Outdated Show resolved Hide resolved
cuda::memory::copy(nn, LOC_ONGPU(ndof), nv * sizeof(int32_t));
cuda::memory::copy(chi2, LOC_ONGPU(chi2), nv * sizeof(float));
cudaMemcpy(&nv, LOC_ONGPU(nvFinal), sizeof(uint32_t), cudaMemcpyDeviceToHost);
cudaMemcpy(nn, LOC_ONGPU(ndof), nv * sizeof(uint32_t), cudaMemcpyDeviceToHost);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here uint32_t was originally int32_t

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified, only cudaMemcpy(nn, LOC_ONGPU(ndof), nv * sizeof(uint32_t), cudaMemcpyDeviceToHost); was originally int32_t

@fwyzard
Copy link

fwyzard commented Oct 28, 2019

Validation summary

Reference release CMSSW_11_0_0_pre7 at 411b633
Development branch CMSSW_11_0_X_Patatrack at 617f9a0
Testing PRs:

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

scan-136.86452.png
zoom-136.86452.png

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/9a76577bb63975315ef69ad1b88a362a5932bc83/log .

@makortel makortel mentioned this pull request Oct 28, 2019
20 tasks
@fwyzard
Copy link

fwyzard commented Oct 28, 2019

@VinInn could you have a look at this PR ?

The changes should be only technical (moving from the cuda::memory::.. wrappers to the standard cudaMemcpy() etc. functions), but we observe a non negligible change in the TTbra realistic performance for the tracks associate to the PV:

  reference-10824.5 development-10824.5 development-10824.52 testing-10824.52
Number of TrackingParticles (after cuts)   4605 4950 5017
Number of matched TrackingParticles   2346 2757 2790
Number of tracks   3410 4371 4416
Number of true tracks   3025 3860 3905
Number of fake tracks   385 511 511
Number of pileup tracks   0 0 0
Number of duplicate tracks   44 0 0

while there doesnt seem to be any change in the overall tracks:

  reference-10824.5 development-10824.5 development-10824.52 testing-10824.52
Efficiency 0.5128 0.5252 0.5818 0.5818
Number of TrackingParticles (after cuts) 5530 5320 5320 5320
Number of matched TrackingParticles 2836 2794 3095 3095
Fake rate 0.0472 0.0479 0.0212 0.0212
Duplicate rate 0.0150 0.0152 0.0003 0.0003
Number of tracks 32648 32656 39763 39763
Number of true tracks 31108 31093 38921 38920
Number of fake tracks 1540 1563 842 843
Number of pileup tracks 27279 27270 34468 34467
Number of duplicate tracks 491 495 12 12

@fwyzard fwyzard requested a review from VinInn October 28, 2019 16:42
}

template <typename T>
inline void copyAsync(cudautils::host::unique_ptr<T>& dst,
const cudautils::device::unique_ptr<T>& src,
cudaStream_t stream) {
static_assert(std::is_array<T>::value == false, "For array types, use the other overload with the size parameter");
cuda::memory::async::copy(dst.get(), src.get(), sizeof(T), stream);
cudaCheck(cudaMemcpyAsync(dst.get(), src.get(), sizeof(T), cudaMemcpyHostToDevice, stream));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is device2host

Copy link

@makortel makortel Oct 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And "Calling cudaMemcpyAsync() with dst and src pointers that do not match the direction of the copy results in an undefined behavior." (*), so specifying the direction explicitly is actually harmful?

(*) https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I think we agreed to remove all explicit directions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling cudaMemcpyAsync() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.

I thought it was supposed to crash...

}

template <typename T>
inline void copyAsync(cudautils::host::unique_ptr<T[]>& dst,
const cudautils::device::unique_ptr<T[]>& src,
size_t nelements,
cudaStream_t stream) {
cuda::memory::async::copy(dst.get(), src.get(), nelements * sizeof(T), stream);
cudaCheck(cudaMemcpyAsync(dst.get(), src.get(), nelements * sizeof(T), cudaMemcpyHostToDevice, stream));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@fwyzard
Copy link

fwyzard commented Oct 28, 2019

Validation summary

Reference release CMSSW_11_0_0_pre7 at 411b633
Development branch CMSSW_11_0_X_Patatrack at 617f9a0
Testing PRs:

Validation plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

Throughput plots

/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53

scan-136.86452.png
zoom-136.86452.png

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • reference release, workflow 10824.5
  • development release, workflow 10824.5
  • development release, workflow 10824.51
  • development release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • development release, workflow 136.86452
  • testing release, workflow 10824.5
  • testing release, workflow 10824.51
  • testing release, workflow 10824.52
    • ✔️ step3.py: log
    • ✔️ profile.py: log
    • ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
    • ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
  • testing release, workflow 136.86452

Logs

The full log is available at https://patatrack.web.cern.ch/patatrack/validation/pulls/4b8bedbe5a9102199e1d28c223042e43fcda503d/log .

@fwyzard
Copy link

fwyzard commented Oct 28, 2019

OK, now it looks better.
The same summary comparison now give identical results:

  reference-10824.5 development-10824.5 development-10824.52 testing-10824.52
Number of TrackingParticles (after cuts)   4605 5017 5017
Number of matched TrackingParticles   2346 2790 2790
Number of tracks   3410 4416 4416
Number of true tracks   3025 3905 3905
Number of fake tracks   385 511 511
Number of pileup tracks   0 0 0
Number of duplicate tracks   44 0 0

and all the others show identical or almost identical results.

@fwyzard fwyzard merged commit 6bfe94f into cms-patatrack:CMSSW_11_0_X_Patatrack Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants