Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pixel Local GPU reco crashes on missing detector input (no Pixel FED) #34496

Closed
czangela opened this issue Jul 15, 2021 · 12 comments
Closed

Pixel Local GPU reco crashes on missing detector input (no Pixel FED) #34496

czangela opened this issue Jul 15, 2021 · 12 comments

Comments

@czangela
Copy link
Contributor

czangela commented Jul 15, 2021

1. Description

Similar to #34197.

The idea here was to remove all FED channels above 1199*, and run the reconstruction on this skimmed raw data.

This was run on the release CMSSW_12_0_0_pre3, and machine cmg-gpu1080.cern.ch.

[*] where 1200 is the minimum FED number for the silicon pixel detector.

2. Crash

The reconstruction crashes with a segmentation fault:

Current Modules:

Module: SiPixelRecHitFromCUDA:siPixelRecHitsPreSplitting@cuda (crashed)
Module: none

A fatal system signal has occurred: segmentation violation
Segmentation fault (core dumped)

Full log:
crash.log

3. Reproduce - Short version

From https://aczirkos.web.cern.ch/aczirkos/pixel_crash_test/ run on the provided dataset:

foo@bar$ cmsRun step3.py

4. Reproduce - Long version

0. SSH to GPU equipped machine

Don't forget to be nice and

export CUDA_VISIBLE_DEVICES=P,Q

Where P, Q, etc. are the numbers of the visible GPUs, which you can view with nvidia-smi.

Init release area CMSSW_12_0_0_pre3.

1. generate configs and run

Use pixelTrackingOnly workflow:

136.885502_RunHLTPhy2018D+RunHLTPhy2018D+HLTDR2_2018+RECODR2_2018reHLT_Patatrack_PixelOnlyGPU+HARVEST2018_pixelTrackingOnly

runTheMatrix.py --what gpu -e -l 136.885502 --ibeos --maxSteps=3

2. modifiy step3_RAW2DIGI_RECO_DQM.py

-> add a new module named rawDataCollector to the beginning of the Schedule -> modules after this will see and use this collection

nonpixelfeds = list(range(0,1200))

process.rawDataCollector = cms.EDProducer( "EvFFEDSelector",
   inputTag = cms.InputTag( "tmpRawDataCollectorTag" ),
   fedList = cms.vuint32(nonpixelfeds)
)

Path, sequence, task definition:

process.rawDataCollector_task = cms.Task(process.rawDataCollector)
process.rawDataCollector_sequence = cms.Sequence(process.rawDataCollector_task)
# Path and EndPath definitions
process.rawDataCollector_step=cms.Path(process.rawDataCollector)

Add to schedule

# Schedule definition
process.schedule = cms.Schedule(process.rawDataCollector_step,process.raw2digi_step,process.reconstruction_step,process.dqmoffline_step,process.dqmofflineOnPAT_step,process.RECOoutput_step,process.MINIAODoutput_step,process.DQMoutput_step)
#process.schedule = cms.Schedule(process.raw2digi_step,process.reconstruction_step,process.dqmoffline_step,process.dqmofflineOnPAT_step,process.RECOoutput_step,process.MINIAODoutput_step,process.DQMoutput_step)

3. sed and replace rawDataCollector InputTags

foo@bar$ edmConfigDump step3_RAW2DIGI_RECO_DQM.py | sed 's/"rawDataCollector"/"fedSkimmedData"/g' | sed 's/"tmpRawDataCollectorTag"/"rawDataCollector"/g' step3_sed.py | sed 's/process.rawDataCollector/process.fedSkimmedData/g' > step3.py

run again

foo@bar$ cmsRun step3.py 
@cmsbuild
Copy link
Contributor

A new Issue was created by @czangela .

@Dr15Jones, @perrotta, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign reconstruction, heterogeneous

@makortel
Copy link
Contributor

FYI @cms-sw/trk-dpg-l2 @VinInn

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,reconstruction

@slava77,@fwyzard,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@VinInn
Copy link
Contributor

VinInn commented Jul 15, 2021

Interesting as it WAS protected for
0 == nHits_
but just by eye now is buggy (check acquire and then produce)
so it is obvious that produce will crash

@mmusich
Copy link
Contributor

mmusich commented Jul 15, 2021

with this

diff --git a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc
index f2f1497b4ba..5861a0be734 100644
--- a/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc
+++ b/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc
@@ -74,7 +74,7 @@ void SiPixelRecHitFromCUDA::acquire(edm::Event const& iEvent,
 
   nHits_ = inputData.nHits();
 
-  LogDebug("SiPixelRecHitFromCUDA") << "converting " << nHits_ << " Hits";
+  LogDebug("SiPixelRecHitFromCUDA") << "converting " << nHits_ << " Hits" << std::endl;
 
   if (0 == nHits_)
     return;
@@ -83,18 +83,21 @@ void SiPixelRecHitFromCUDA::acquire(edm::Event const& iEvent,
 }
 
 void SiPixelRecHitFromCUDA::produce(edm::Event& iEvent, edm::EventSetup const& es) {
+
   // allocate a buffer for the indices of the clusters
   auto hmsp = std::make_unique<uint32_t[]>(gpuClustering::maxNumModules + 1);
-  std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + gpuClustering::maxNumModules + 1, hmsp.get());
-  // wrap the buffer in a HostProduct, and move it to the Event, without reallocating the buffer or affecting hitsModuleStart
-  iEvent.emplace(hostPutToken_, std::move(hmsp));
 
   SiPixelRecHitCollection output;
+  output.reserve(gpuClustering::maxNumModules, nHits_);
   if (0 == nHits_) {
     iEvent.emplace(rechitsPutToken_, std::move(output));
+    iEvent.emplace(hostPutToken_, std::move(hmsp));
     return;
   }
-  output.reserve(gpuClustering::maxNumModules, nHits_);
+
+  std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + gpuClustering::maxNumModules + 1, hmsp.get());
+  // wrap the buffer in a HostProduct, and move it to the Event, without reallocating the buffer or affecting hitsModuleStart
+  iEvent.emplace(hostPutToken_, std::move(hmsp));
 
   auto xl = store32_.get();
   auto yl = xl + nHits_;

it runs for me.
Acceptable?

@VinInn
Copy link
Contributor

VinInn commented Jul 15, 2021

one can leave the reserve after the return; (very very minor).
I think should be ok.

still: was tested (long long ago) with nHits_==0; and the history has been erased by the file renaming. So no way to understand when and why was changed.

@VinInn
Copy link
Contributor

VinInn commented Jul 15, 2021

ok found a version in CMSSW_11_1_0_pre8_Patatrack when was named SiPixelRecHitFromSOA.cc and the code is the same.
No clue "how" was tested then... (maybe before introducing "hmsp")

@VinInn
Copy link
Contributor

VinInn commented Jul 15, 2021

maybe easier to just protect the copy

if(hitsModuleStart_) std::copy(hitsModuleStart_.get(), hitsModuleStart_.get() + gpuClustering::maxNumModules + 1, hmsp.get());

@makortel
Copy link
Contributor

+heterogeneous

PRs to master and 11_3_X have been merged

@slava77
Copy link
Contributor

slava77 commented Jul 22, 2021

+reconstruction

#34501 and #34503 were merged

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants