DeepAK8 tagger integration #23768

hqucms · 2018-07-09T11:41:43Z

Introduction

This PR is to integrate the DeepAK8 tagger into CMSSW. The DeepAK8 tagger is a multi-class tagger for identifying boosted hadronic top, W, Z, Higgs using AK8 jets. It uses low-level inputs (jet constituent particles and secondary vertices) and customized deep neural networks, and have shown significant improvement in performance compared to traditional approaches. Two versions of DeepAK8 have been developed: the nominal version aims at achieving the best possible performance but sculpts the mass distribution in background jets, while the mass-decorrelated version aims at minimizing the mass sculpting while keeping the performance as much as possible. Both versions are included in this PR. More details about the DeepAK8 tagger are summarized in twiki, slides, CMS-DP-2017-049, and NIPS paper.

Prerequisites

MXNet (as an CMSSW external): cms-sw/cmsdist#4167
DNN model files: cms-data/RecoBTag-Combined#15

Implementation

The implementation in this PR is based on the b-tagging framework. Similar to DeepFlavour and DeepDoubleB, the DeepAK8 tagger (named pfDeepBoostedJetTags and pfMassDecorrelatedDeepBoostedJetTags in this PR) is also trained with MiniAOD inputs (e.g., pat::PackedCandidate), so we follow the same strategy and add this tagger to MiniAOD by updating b-tagging on slimmedJetsAK8. We tried to set up the code to run on RECO inputs (e.g., reco::PFCandidate) as well but that part does not work now and is not used anywhere in this PR. This may be revisited in the future.

An overview of the changes:

DataFormats/BTauReco:

new classes for the features and taginfo
moved FeaturesTagInfo to a separate file, since it is the base class for DeepFlavour, DeepDoubleB and DeepBoostedJet now

RecoBTag/DeepBoostedJet:

producers for the TagInfo and the Tag results
corresponding cfi/cff

RecoBTag/TensorFlow:

small refactor to allow some functions to be reused in DeepBoostedJet

PhysicsTools/MXNet:

convenience wrapper of MXNet based on the C prediction API and unit tests

RecoBTag/Configuration, PhysicsTools/PatAlgos:

enable DeepBoostedJet in b-tagging framework
add DeepBoostedJet to MiniAOD

DataFormats/Candidate/interface/CompositePtrCandidate.h,
DataFormats/PatCandidates/interface/Jet.h:

override clearDaughters in pat::Jet to reset the daughter cache too
This is needed due to changes introduced by Simplify jet constituent access in MiniAOD #22914. See hqucms@c04145d#commitcomment-29634182 for the detailed explanation.

Validation

We have compared the discriminator values from this CMSSW implementation to the results obtained from the training framework using a TTBar RelVal sample. As shown below, the CMSSW implementation (running on RECO->MiniAOD) reproduces the results of the training framework (MiniAOD->standalone MXNet) very well.

cmsbuild · 2018-07-09T11:43:04Z

The code-checks are being triggered in jenkins.

cmsbuild · 2018-07-09T11:55:24Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-23768/5441

cmsbuild · 2018-07-09T11:55:43Z

A new Pull Request was created by @hqucms (Huilin Qu) for master.

It involves the following packages:

DataFormats/BTauReco
DataFormats/Candidate
DataFormats/PatCandidates
PhysicsTools/MXNet
PhysicsTools/PatAlgos
RecoBTag/Configuration
RecoBTag/DeepBoostedJet
RecoBTag/TensorFlow

The following packages do not have a category, yet:

PhysicsTools/MXNet
RecoBTag/DeepBoostedJet
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@perrotta, @monttj, @cmsbuild, @slava77, @gpetruc, @arizzi can you please review it and eventually sign? Thanks.
@TaiSakuma, @gouskos, @rappoccio, @HeinerTholen, @seemasharmafnal, @mmarionncern, @imarches, @ahinzmann, @smoortga, @acaudron, @jdolen, @drkovalskyi, @ferencek, @rovere, @jdamgov, @nhanvtran, @gkasieczka, @schoef, @clelange, @JyothsnaKomaragiri, @mverzett, @cbernet, @gpetruc, @mariadalfonso, @pvmulder this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

slava77 · 2018-07-11T05:33:36Z

@cmsbuild please tests with cms-sw/cmsdist#4185

slava77 · 2018-07-11T06:58:00Z

@cmsbuild please test with cms-sw/cmsdist#4185

trying again
@smuzaffar , was there some downtime in jenkins, or do I have some typo in the test request?

cmsbuild · 2018-07-11T06:58:22Z

The tests are being triggered in jenkins.
Using externals from cms-sw/cmsdist#4185
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/29070/console

davidlange6 · 2018-07-11T07:05:06Z

you have a typo.. please tests vs please test

…

On Jul 11, 2018, at 9:58 AM, Slava Krutelyov ***@***.***> wrote: @cmsbuild please test with cms-sw/cmsdist#4185 trying again @smuzaffar , was there some downtime in jenkins, or do I have some typo in the test request? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

cmsbuild · 2018-07-11T16:13:04Z

-1

Tested at: 9160779

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-23768/29070/summary.html

I found follow errors while testing this PR

Failed tests: RelVals AddOn

RelVals:

The relvals timed out after 2 hours.
When I ran the RelVals I found an error in the following worklfows:
1000.0 step3

runTheMatrix-results/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step3_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log

136.85 step3

runTheMatrix-results/136.85_RunEGamma2018A+RunEGamma2018A+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM/step3_RunEGamma2018A+RunEGamma2018A+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM.log

20434.0 step1

runTheMatrix-results/20434.0_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D19_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D19+RecoFullGlobal_2023D19+HARVESTFullGlobal_2023D19/step1_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D19_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D19+RecoFullGlobal_2023D19+HARVESTFullGlobal_2023D19.log

21234.0 step1

runTheMatrix-results/21234.0_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D21_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D21+RecoFullGlobal_2023D21+HARVESTFullGlobal_2023D21/step1_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D21_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D21+RecoFullGlobal_2023D21+HARVESTFullGlobal_2023D21.log

AddOn:

I found errors in the following addon tests:

kpedro88

I think it is also important to test the CPU usage of this module compared to other reco/analysis modules, and understand if the network can be simplified or if other optimizations can be made.

kpedro88 · 2018-07-12T08:53:50Z

RecoBTag/DeepBoostedJet/plugins/DeepBoostedJetTagsProducer.cc

+      // convert inputs
+      make_inputs(taginfo);
+      // run prediction and get outputs
+      outputs = predictor_->predict(data_);


When I ran a simplified version of this producer (https://github.com/TreeMaker/TreeMaker/blob/c3be78637ae6d4ba4692cd01b68a1b149f005a38/Utils/src/DeepAK8Producer.cc, using the GitLab version) over 2018 prompt data (just for testing), I occasionally ran into "Error running forward" exceptions. This happened when I ran using 4 threads, but not when I reran the same event using 1 thread, so I suspect it is a data race or other thread-safety issue in the mxnet library.

The GitLab version used mxnet 1.1.0, while the version added to CMSSW is 1.2.0 (cms-sw/cmsdist#4167). It is possible the data race was fixed in the newer version. However, this should be tested carefully.

@kpedro88 This is very useful to know. When I was testing this PR, I tried running over the TTBar RelVal samples with 8 threads for 9000 events and did not see any error. Then, looking at the MXNet changelog from 1.1.0 to 1.2.0, I indeed noticed this one:

Fixed race condition for CPUSharedStorageManager->Free and launched workers at iter init stage to avoid frequent relaunch (apache/mxnet#10096).

So I suspect this is the cause for what you saw, and moving to 1.2.0 should solve it. Of course, more tests and feedback are more than welcome :)

It sometimes took more than 10K events before I saw an exception when running with 4 threads. @slava77 told me he might try to run it on KNL, in which case we will definitely find out if there are still data races in the 1.2.0 release of mxnet.

OK, then I can probably try to run with more events.

@slava77 told me he might try to run it on KNL, in which case we will definitely find out if there are still data races in the 1.2.0 release of mxnet.

This would be very interesting to see :)

kpedro88 · 2018-07-12T08:54:21Z

RecoBTag/DeepBoostedJet/plugins/DeepBoostedJetTagsProducer.cc

+
+  }
+
+  if (debug_){


all debug outputs should be replaced with LogDebug

kpedro88 · 2018-07-12T08:59:01Z

RecoBTag/DeepBoostedJet/python/pfMassDecorrelatedDeepBoostedDiscriminatorsJetTags_cfi.py

+   'BTagProbabilityToDiscriminator',
+   discriminators = cms.VPSet(
+      cms.PSet(
+         name = cms.string('TvsQCD'),


assuming these are the "binarized" scores provided in the GitLab version, it would be nice to add the separate ZvsQCD, ZbbvsQCD, HbbvsQCD, H4qvsQCD discriminators that were provided there

@kpedro88 These scores are added now: https://github.com/cms-sw/cmssw/pull/23768/files#diff-0b5067ac36f218c17fd59ded2a4272b6R39.

kpedro88 · 2018-07-12T09:02:39Z

I also hope this can be backported to 94X for analysis use (and maybe 101X for prompt data studies, though not as essential).

fabiocos · 2018-07-12T09:05:07Z

@kpedro88 I think that we should test it before in master, then if there is a use case for 94X we can consider it

kpedro88 · 2018-07-12T09:07:30Z

@fabiocos sure, it should be tested in master first (and needs extensive testing, IMO). But I know of several 2016+2017 analyses that want to use this, so it's preferable to have a 94X release that includes it, eventually.

slava77 · 2018-07-12T12:48:34Z

@cmsbuild please test with cms-sw/cmsdist#4185

hqucms · 2018-09-07T14:10:37Z

Looking in the test outputs, e.g. 136.8311, we have new printouts

[17:56:14] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade...
[17:56:14] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[17:56:14] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.1.0. Attempting to upgrade...
[17:56:14] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
these better be suppressed by using up-to-date inputs.

This should be fixed by cms-data/RecoBTag-Combined#16.

slava77 · 2018-09-07T14:40:13Z

@cmsbuild please test with cms-sw/cmsdist#4317

cmsbuild · 2018-09-07T14:42:09Z

The tests are being triggered in jenkins.
Using externals from cms-sw/cmsdist#4317
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/30296/console

hqucms · 2018-09-07T17:39:08Z

@slava77
I tested the Phase2 workflow 20034.0 and 20034.11 and they seem to run fine. I also took a look at the timing on a PU200 ttbar sample with:
cmsDriver.py step1 --filein 'dbs:/RelValTTbar_14TeV/CMSSW_10_2_0-PU25ns_102X_upgrade2023_realistic_v7_2023D29PU200-v1/GEN-SIM-RECO' -n 100 --fileout file:output_step1.root --mc --eventcontent AODSIM,MINIAODSIM --runUnscheduled --datatier AODSIM,MINIAODSIM --conditions auto:phase2_realistic --beamspot HLLHC14TeV --customise_commands "process.AODSIMoutput.outputCommands.append('keep recoTrackExtras_generalTracks_*_*')" --step PAT --nThreads 8 --geometry Extended2023D29 --era Phase2_timing --python_filename phase2.py --no_exec

And it also looks reasonable to me:

TimeReport   0.000059     0.000059     0.000059  pfDeepBoostedDiscriminatorsJetTagsSlimmedAK8DeepTags
TimeReport   0.000337     0.000337     0.000337  pfDeepBoostedJetTagInfosSlimmedAK8DeepTags
TimeReport   0.006716     0.006716     0.006716  pfDeepBoostedJetTagsSlimmedAK8DeepTags
TimeReport   0.000062     0.000062     0.000062  pfMassDecorrelatedDeepBoostedDiscriminatorsJetTagsSlimmedAK8DeepTags
TimeReport   0.007515     0.007515     0.007515  pfMassDecorrelatedDeepBoostedJetTagsSlimmedAK8DeepTags

cmsbuild · 2018-09-07T20:06:58Z

+1
Tested at: 1acac95
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-23768/30296/summary.html

cmsbuild · 2018-09-07T20:07:06Z

Comparison job queued.

cmsbuild · 2018-09-07T21:55:14Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-23768/30296/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 500 differences found in the comparisons
DQMHistoTests: Total files compared: 32
DQMHistoTests: Total histograms compared: 3143975
DQMHistoTests: Total failures: 2
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3143776
DQMHistoTests: Total skipped: 197
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
Checked 133 log files, 14 edm output root files, 32 DQM output files

slava77 · 2018-09-10T23:22:20Z

+1

for #23768 1acac95

code changes are in line with the PR description and the follow up review. This PR is expected to modify the miniAOD content in saved AK8 jet discriminants (embedded in the jets): 46 new discriminants were added, compared to 10 previously available.
jenkins tests pass and comparisons with the baseline show differences in the AK8 jet discriminants size.
[partly based in https://github.com/DeepAK8 tagger integration #23768#issuecomment-417055124] local tests show the cost increase from running the new taggers is acceptable:
- disk size is up by less than 1% (1% is observed in 1000 event test, but it should get a bit better with more events and compression)
- CPU use has increased by about 3% for miniAOD-only jobs (~0.3% of total reco time)
- RSS size increased by 23 MB, of which 11 MB is expected to scale up with the number of threads

fabiocos · 2018-09-11T12:04:01Z

+1

the python additions look compatible with the recent updates
a data file *params is added in a test area, but it is only 480 bytes large, it can be ok

fabiocos · 2018-09-11T12:04:21Z

merge

cmsbuild added this to the CMSSW_10_2_X milestone Jul 9, 2018

cmsbuild added analysis-pending code-checks-pending comparison-pending new-package-pending orp-pending pending-signatures reconstruction-pending tests-pending labels Jul 9, 2018

cmsbuild added code-checks-approved and removed code-checks-pending labels Jul 9, 2018

slava77 mentioned this pull request Jul 11, 2018

add mxnet and update data-RecoBTag-Combined.spec (merge of #4167 and #4176) needed for cmssw tests cms-sw/cmsdist#4185

Closed

cmsbuild added requires-external tests-started and removed tests-pending labels Jul 11, 2018

cmsbuild added tests-rejected and removed tests-started labels Jul 11, 2018

kpedro88 reviewed Jul 12, 2018

View reviewed changes

hqucms mentioned this pull request Sep 7, 2018

Update DeepBoostedJet json to match the MXNet version. cms-data/RecoBTag-Combined#16

Merged

cmsbuild mentioned this pull request Sep 7, 2018

Update DeepBoostedJet json to match the MXNet version cms-sw/cmsdist#4317

Merged

cmsbuild added requires-external tests-started and removed tests-pending labels Sep 7, 2018

cmsbuild added tests-approved and removed tests-started labels Sep 7, 2018

cmsbuild added comparison-available and removed comparison-pending labels Sep 7, 2018

slava77 mentioned this pull request Sep 10, 2018

categories for new packages PhysicsTools/MXNet, RecoBTag/FeatureTools, RecoBTag/MXNet cms-sw/cms-bot#1027

Merged

cmsbuild added reconstruction-approved and removed reconstruction-pending new-package-pending labels Sep 10, 2018

cmsbuild added orp-approved and removed orp-pending labels Sep 11, 2018

cmsbuild merged commit e6ee4a8 into cms-sw:master Sep 11, 2018

This was referenced Sep 11, 2018

[102X][Backport] DeepAK8 tagger integration #24505

Merged

[94X][Backport] DeepAK8 tagger integration #24506

Merged

slava77 mentioned this pull request Sep 12, 2018

(minor) updates in reco comparisons cms-sw/cms-bot#1029

Merged

andrzejnovak mentioned this pull request Oct 18, 2018

DeepDoubleCvL and DeepDoubleCvB tagger integration #24918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepAK8 tagger integration #23768

DeepAK8 tagger integration #23768

hqucms commented Jul 9, 2018

cmsbuild commented Jul 9, 2018

cmsbuild commented Jul 9, 2018

cmsbuild commented Jul 9, 2018

slava77 commented Jul 11, 2018

slava77 commented Jul 11, 2018

cmsbuild commented Jul 11, 2018 •

edited

davidlange6 commented Jul 11, 2018 via email

cmsbuild commented Jul 11, 2018

kpedro88 left a comment

kpedro88 Jul 12, 2018

hqucms Jul 12, 2018

kpedro88 Jul 12, 2018

hqucms Jul 12, 2018

kpedro88 Jul 12, 2018

kpedro88 Jul 12, 2018

hqucms Sep 3, 2018

kpedro88 commented Jul 12, 2018

fabiocos commented Jul 12, 2018

kpedro88 commented Jul 12, 2018

slava77 commented Jul 12, 2018

hqucms commented Sep 7, 2018

slava77 commented Sep 7, 2018

cmsbuild commented Sep 7, 2018 •

edited

hqucms commented Sep 7, 2018

cmsbuild commented Sep 7, 2018

cmsbuild commented Sep 7, 2018

cmsbuild commented Sep 7, 2018

slava77 commented Sep 10, 2018

fabiocos commented Sep 11, 2018

fabiocos commented Sep 11, 2018

DeepAK8 tagger integration #23768

DeepAK8 tagger integration #23768

Conversation

hqucms commented Jul 9, 2018

Introduction

Prerequisites

Implementation

Validation

cmsbuild commented Jul 9, 2018

cmsbuild commented Jul 9, 2018

cmsbuild commented Jul 9, 2018

slava77 commented Jul 11, 2018

slava77 commented Jul 11, 2018

cmsbuild commented Jul 11, 2018 • edited

davidlange6 commented Jul 11, 2018 via email

cmsbuild commented Jul 11, 2018

kpedro88 left a comment

Choose a reason for hiding this comment

kpedro88 Jul 12, 2018

Choose a reason for hiding this comment

hqucms Jul 12, 2018

Choose a reason for hiding this comment

kpedro88 Jul 12, 2018

Choose a reason for hiding this comment

hqucms Jul 12, 2018

Choose a reason for hiding this comment

kpedro88 Jul 12, 2018

Choose a reason for hiding this comment

kpedro88 Jul 12, 2018

Choose a reason for hiding this comment

hqucms Sep 3, 2018

Choose a reason for hiding this comment

kpedro88 commented Jul 12, 2018

fabiocos commented Jul 12, 2018

kpedro88 commented Jul 12, 2018

slava77 commented Jul 12, 2018

hqucms commented Sep 7, 2018

slava77 commented Sep 7, 2018

cmsbuild commented Sep 7, 2018 • edited

hqucms commented Sep 7, 2018

cmsbuild commented Sep 7, 2018

cmsbuild commented Sep 7, 2018

cmsbuild commented Sep 7, 2018

slava77 commented Sep 10, 2018

fabiocos commented Sep 11, 2018

fabiocos commented Sep 11, 2018

cmsbuild commented Jul 11, 2018 •

edited

cmsbuild commented Sep 7, 2018 •

edited