New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepAK8 tagger integration #23768
DeepAK8 tagger integration #23768
Conversation
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-23768/5441 |
A new Pull Request was created by @hqucms (Huilin Qu) for master. It involves the following packages: DataFormats/BTauReco The following packages do not have a category, yet: PhysicsTools/MXNet @perrotta, @monttj, @cmsbuild, @slava77, @gpetruc, @arizzi can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@cmsbuild please tests with cms-sw/cmsdist#4185 |
@cmsbuild please test with cms-sw/cmsdist#4185 trying again |
The tests are being triggered in jenkins. |
you have a typo.. please tests vs please test
… On Jul 11, 2018, at 9:58 AM, Slava Krutelyov ***@***.***> wrote:
@cmsbuild please test with cms-sw/cmsdist#4185
trying again
@smuzaffar , was there some downtime in jenkins, or do I have some typo in the test request?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
-1 Tested at: 9160779 You can see the results of the tests here: I found follow errors while testing this PR Failed tests: RelVals AddOn
The relvals timed out after 2 hours. runTheMatrix-results/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step3_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log136.85 step3 runTheMatrix-results/136.85_RunEGamma2018A+RunEGamma2018A+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM/step3_RunEGamma2018A+RunEGamma2018A+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM.log20434.0 step1 runTheMatrix-results/20434.0_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D19_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D19+RecoFullGlobal_2023D19+HARVESTFullGlobal_2023D19/step1_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D19_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D19+RecoFullGlobal_2023D19+HARVESTFullGlobal_2023D19.log21234.0 step1 runTheMatrix-results/21234.0_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D21_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D21+RecoFullGlobal_2023D21+HARVESTFullGlobal_2023D21/step1_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D21_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D21+RecoFullGlobal_2023D21+HARVESTFullGlobal_2023D21.log
I found errors in the following addon tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is also important to test the CPU usage of this module compared to other reco/analysis modules, and understand if the network can be simplified or if other optimizations can be made.
// convert inputs | ||
make_inputs(taginfo); | ||
// run prediction and get outputs | ||
outputs = predictor_->predict(data_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran a simplified version of this producer (https://github.com/TreeMaker/TreeMaker/blob/c3be78637ae6d4ba4692cd01b68a1b149f005a38/Utils/src/DeepAK8Producer.cc, using the GitLab version) over 2018 prompt data (just for testing), I occasionally ran into "Error running forward"
exceptions. This happened when I ran using 4 threads, but not when I reran the same event using 1 thread, so I suspect it is a data race or other thread-safety issue in the mxnet library.
The GitLab version used mxnet 1.1.0, while the version added to CMSSW is 1.2.0 (cms-sw/cmsdist#4167). It is possible the data race was fixed in the newer version. However, this should be tested carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpedro88 This is very useful to know. When I was testing this PR, I tried running over the TTBar RelVal samples with 8 threads for 9000 events and did not see any error. Then, looking at the MXNet changelog from 1.1.0 to 1.2.0, I indeed noticed this one:
Fixed race condition for CPUSharedStorageManager->Free and launched workers at iter init stage to avoid frequent relaunch (apache/mxnet#10096).
So I suspect this is the cause for what you saw, and moving to 1.2.0 should solve it. Of course, more tests and feedback are more than welcome :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sometimes took more than 10K events before I saw an exception when running with 4 threads. @slava77 told me he might try to run it on KNL, in which case we will definitely find out if there are still data races in the 1.2.0 release of mxnet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then I can probably try to run with more events.
@slava77 told me he might try to run it on KNL, in which case we will definitely find out if there are still data races in the 1.2.0 release of mxnet.
This would be very interesting to see :)
|
||
} | ||
|
||
if (debug_){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all debug outputs should be replaced with LogDebug
'BTagProbabilityToDiscriminator', | ||
discriminators = cms.VPSet( | ||
cms.PSet( | ||
name = cms.string('TvsQCD'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming these are the "binarized" scores provided in the GitLab version, it would be nice to add the separate ZvsQCD, ZbbvsQCD, HbbvsQCD, H4qvsQCD discriminators that were provided there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpedro88 These scores are added now: https://github.com/cms-sw/cmssw/pull/23768/files#diff-0b5067ac36f218c17fd59ded2a4272b6R39.
I also hope this can be backported to 94X for analysis use (and maybe 101X for prompt data studies, though not as essential). |
@kpedro88 I think that we should test it before in master, then if there is a use case for 94X we can consider it |
@fabiocos sure, it should be tested in master first (and needs extensive testing, IMO). But I know of several 2016+2017 analyses that want to use this, so it's preferable to have a 94X release that includes it, eventually. |
@cmsbuild please test with cms-sw/cmsdist#4185 |
This should be fixed by cms-data/RecoBTag-Combined#16. |
@cmsbuild please test with cms-sw/cmsdist#4317 |
The tests are being triggered in jenkins. |
@slava77 And it also looks reasonable to me:
|
Comparison job queued. |
Comparison is ready Comparison Summary:
|
+1
|
+1 the python additions look compatible with the recent updates |
merge |
Introduction
This PR is to integrate the DeepAK8 tagger into CMSSW. The DeepAK8 tagger is a multi-class tagger for identifying boosted hadronic top, W, Z, Higgs using AK8 jets. It uses low-level inputs (jet constituent particles and secondary vertices) and customized deep neural networks, and have shown significant improvement in performance compared to traditional approaches. Two versions of DeepAK8 have been developed: the nominal version aims at achieving the best possible performance but sculpts the mass distribution in background jets, while the mass-decorrelated version aims at minimizing the mass sculpting while keeping the performance as much as possible. Both versions are included in this PR. More details about the DeepAK8 tagger are summarized in twiki, slides, CMS-DP-2017-049, and NIPS paper.
Prerequisites
MXNet (as an CMSSW external): cms-sw/cmsdist#4167
DNN model files: cms-data/RecoBTag-Combined#15
Implementation
The implementation in this PR is based on the b-tagging framework. Similar to DeepFlavour and DeepDoubleB, the DeepAK8 tagger (named
pfDeepBoostedJetTags
andpfMassDecorrelatedDeepBoostedJetTags
in this PR) is also trained with MiniAOD inputs (e.g.,pat::PackedCandidate
), so we follow the same strategy and add this tagger to MiniAOD by updating b-tagging onslimmedJetsAK8
. We tried to set up the code to run on RECO inputs (e.g.,reco::PFCandidate
) as well but that part does not work now and is not used anywhere in this PR. This may be revisited in the future.An overview of the changes:
DataFormats/BTauReco
:FeaturesTagInfo
to a separate file, since it is the base class for DeepFlavour, DeepDoubleB and DeepBoostedJet nowRecoBTag/DeepBoostedJet
:RecoBTag/TensorFlow
:PhysicsTools/MXNet
:RecoBTag/Configuration
,PhysicsTools/PatAlgos
:DataFormats/Candidate/interface/CompositePtrCandidate.h
,DataFormats/PatCandidates/interface/Jet.h
:Validation
We have compared the discriminator values from this CMSSW implementation to the results obtained from the training framework using a TTBar RelVal sample. As shown below, the CMSSW implementation (running on RECO->MiniAOD) reproduces the results of the training framework (MiniAOD->standalone MXNet) very well.