Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepAK8 tagger integration #23768

Merged
merged 29 commits into from Sep 11, 2018
Merged

DeepAK8 tagger integration #23768

merged 29 commits into from Sep 11, 2018

Conversation

hqucms
Copy link
Contributor

@hqucms hqucms commented Jul 9, 2018

Introduction

This PR is to integrate the DeepAK8 tagger into CMSSW. The DeepAK8 tagger is a multi-class tagger for identifying boosted hadronic top, W, Z, Higgs using AK8 jets. It uses low-level inputs (jet constituent particles and secondary vertices) and customized deep neural networks, and have shown significant improvement in performance compared to traditional approaches. Two versions of DeepAK8 have been developed: the nominal version aims at achieving the best possible performance but sculpts the mass distribution in background jets, while the mass-decorrelated version aims at minimizing the mass sculpting while keeping the performance as much as possible. Both versions are included in this PR. More details about the DeepAK8 tagger are summarized in twiki, slides, CMS-DP-2017-049, and NIPS paper.

Prerequisites

MXNet (as an CMSSW external): cms-sw/cmsdist#4167
DNN model files: cms-data/RecoBTag-Combined#15

Implementation

The implementation in this PR is based on the b-tagging framework. Similar to DeepFlavour and DeepDoubleB, the DeepAK8 tagger (named pfDeepBoostedJetTags and pfMassDecorrelatedDeepBoostedJetTags in this PR) is also trained with MiniAOD inputs (e.g., pat::PackedCandidate), so we follow the same strategy and add this tagger to MiniAOD by updating b-tagging on slimmedJetsAK8. We tried to set up the code to run on RECO inputs (e.g., reco::PFCandidate) as well but that part does not work now and is not used anywhere in this PR. This may be revisited in the future.

An overview of the changes:

DataFormats/BTauReco:

  • new classes for the features and taginfo
  • moved FeaturesTagInfo to a separate file, since it is the base class for DeepFlavour, DeepDoubleB and DeepBoostedJet now

RecoBTag/DeepBoostedJet:

  • producers for the TagInfo and the Tag results
  • corresponding cfi/cff

RecoBTag/TensorFlow:

  • small refactor to allow some functions to be reused in DeepBoostedJet

PhysicsTools/MXNet:

  • convenience wrapper of MXNet based on the C prediction API and unit tests

RecoBTag/Configuration, PhysicsTools/PatAlgos:

  • enable DeepBoostedJet in b-tagging framework
  • add DeepBoostedJet to MiniAOD

DataFormats/Candidate/interface/CompositePtrCandidate.h,
DataFormats/PatCandidates/interface/Jet.h:

Validation

We have compared the discriminator values from this CMSSW implementation to the results obtained from the training framework using a TTBar RelVal sample. As shown below, the CMSSW implementation (running on RECO->MiniAOD) reproduces the results of the training framework (MiniAOD->standalone MXNet) very well.

comp_pfdeepboosteddiscriminatorsjettags_tvsqcd
comp_pfdeepboosteddiscriminatorsjettags_wvsqcd
comp_pfdeepboostedjettags_probhbb

comp_pfmassdecorrelateddeepboosteddiscriminatorsjettags_tvsqcd
comp_pfmassdecorrelateddeepboosteddiscriminatorsjettags_wvsqcd
comp_pfmassdecorrelateddeepboostedjettags_probhbb

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 9, 2018

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 9, 2018

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 9, 2018

A new Pull Request was created by @hqucms (Huilin Qu) for master.

It involves the following packages:

DataFormats/BTauReco
DataFormats/Candidate
DataFormats/PatCandidates
PhysicsTools/MXNet
PhysicsTools/PatAlgos
RecoBTag/Configuration
RecoBTag/DeepBoostedJet
RecoBTag/TensorFlow

The following packages do not have a category, yet:

PhysicsTools/MXNet
RecoBTag/DeepBoostedJet
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@perrotta, @monttj, @cmsbuild, @slava77, @gpetruc, @arizzi can you please review it and eventually sign? Thanks.
@TaiSakuma, @gouskos, @rappoccio, @HeinerTholen, @seemasharmafnal, @mmarionncern, @imarches, @ahinzmann, @smoortga, @acaudron, @jdolen, @drkovalskyi, @ferencek, @rovere, @jdamgov, @nhanvtran, @gkasieczka, @schoef, @clelange, @JyothsnaKomaragiri, @mverzett, @cbernet, @gpetruc, @mariadalfonso, @pvmulder this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented Jul 11, 2018

@cmsbuild please tests with cms-sw/cmsdist#4185

@slava77
Copy link
Contributor

slava77 commented Jul 11, 2018

@cmsbuild please test with cms-sw/cmsdist#4185

trying again
@smuzaffar , was there some downtime in jenkins, or do I have some typo in the test request?

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 11, 2018

The tests are being triggered in jenkins.
Using externals from cms-sw/cmsdist#4185
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/29070/console

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 11, 2018 via email

@cmsbuild
Copy link
Contributor

-1

Tested at: 9160779

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-23768/29070/summary.html

I found follow errors while testing this PR

Failed tests: RelVals AddOn

  • RelVals:

The relvals timed out after 2 hours.
When I ran the RelVals I found an error in the following worklfows:
1000.0 step3

runTheMatrix-results/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step3_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log

136.85 step3
runTheMatrix-results/136.85_RunEGamma2018A+RunEGamma2018A+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM/step3_RunEGamma2018A+RunEGamma2018A+HLTDR2_2018+RECODR2_2018reHLT_skimEGamma_Prompt_L1TEgDQM+HARVEST2018_L1TEgDQM.log

20434.0 step1
runTheMatrix-results/20434.0_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D19_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D19+RecoFullGlobal_2023D19+HARVESTFullGlobal_2023D19/step1_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D19_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D19+RecoFullGlobal_2023D19+HARVESTFullGlobal_2023D19.log

21234.0 step1
runTheMatrix-results/21234.0_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D21_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D21+RecoFullGlobal_2023D21+HARVESTFullGlobal_2023D21/step1_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D21_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D21+RecoFullGlobal_2023D21+HARVESTFullGlobal_2023D21.log

  • AddOn:

I found errors in the following addon tests:

Copy link
Contributor

@kpedro88 kpedro88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is also important to test the CPU usage of this module compared to other reco/analysis modules, and understand if the network can be simplified or if other optimizations can be made.

// convert inputs
make_inputs(taginfo);
// run prediction and get outputs
outputs = predictor_->predict(data_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran a simplified version of this producer (https://github.com/TreeMaker/TreeMaker/blob/c3be78637ae6d4ba4692cd01b68a1b149f005a38/Utils/src/DeepAK8Producer.cc, using the GitLab version) over 2018 prompt data (just for testing), I occasionally ran into "Error running forward" exceptions. This happened when I ran using 4 threads, but not when I reran the same event using 1 thread, so I suspect it is a data race or other thread-safety issue in the mxnet library.

The GitLab version used mxnet 1.1.0, while the version added to CMSSW is 1.2.0 (cms-sw/cmsdist#4167). It is possible the data race was fixed in the newer version. However, this should be tested carefully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpedro88 This is very useful to know. When I was testing this PR, I tried running over the TTBar RelVal samples with 8 threads for 9000 events and did not see any error. Then, looking at the MXNet changelog from 1.1.0 to 1.2.0, I indeed noticed this one:

Fixed race condition for CPUSharedStorageManager->Free and launched workers at iter init stage to avoid frequent relaunch (apache/mxnet#10096).

So I suspect this is the cause for what you saw, and moving to 1.2.0 should solve it. Of course, more tests and feedback are more than welcome :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sometimes took more than 10K events before I saw an exception when running with 4 threads. @slava77 told me he might try to run it on KNL, in which case we will definitely find out if there are still data races in the 1.2.0 release of mxnet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then I can probably try to run with more events.

@slava77 told me he might try to run it on KNL, in which case we will definitely find out if there are still data races in the 1.2.0 release of mxnet.

This would be very interesting to see :)


}

if (debug_){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all debug outputs should be replaced with LogDebug

'BTagProbabilityToDiscriminator',
discriminators = cms.VPSet(
cms.PSet(
name = cms.string('TvsQCD'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming these are the "binarized" scores provided in the GitLab version, it would be nice to add the separate ZvsQCD, ZbbvsQCD, HbbvsQCD, H4qvsQCD discriminators that were provided there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpedro88
Copy link
Contributor

I also hope this can be backported to 94X for analysis use (and maybe 101X for prompt data studies, though not as essential).

@fabiocos
Copy link
Contributor

@kpedro88 I think that we should test it before in master, then if there is a use case for 94X we can consider it

@kpedro88
Copy link
Contributor

@fabiocos sure, it should be tested in master first (and needs extensive testing, IMO). But I know of several 2016+2017 analyses that want to use this, so it's preferable to have a 94X release that includes it, eventually.

@slava77
Copy link
Contributor

slava77 commented Jul 12, 2018

@cmsbuild please test with cms-sw/cmsdist#4185

@hqucms
Copy link
Contributor Author

hqucms commented Sep 7, 2018

Looking in the test outputs, e.g. 136.8311, we have new printouts

[17:56:14] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade...
[17:56:14] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[17:56:14] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.1.0. Attempting to upgrade...
[17:56:14] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
these better be suppressed by using up-to-date inputs.

This should be fixed by cms-data/RecoBTag-Combined#16.

@slava77
Copy link
Contributor

slava77 commented Sep 7, 2018

@cmsbuild please test with cms-sw/cmsdist#4317

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2018

The tests are being triggered in jenkins.
Using externals from cms-sw/cmsdist#4317
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/30296/console

@hqucms
Copy link
Contributor Author

hqucms commented Sep 7, 2018

@slava77
I tested the Phase2 workflow 20034.0 and 20034.11 and they seem to run fine. I also took a look at the timing on a PU200 ttbar sample with:
cmsDriver.py step1 --filein 'dbs:/RelValTTbar_14TeV/CMSSW_10_2_0-PU25ns_102X_upgrade2023_realistic_v7_2023D29PU200-v1/GEN-SIM-RECO' -n 100 --fileout file:output_step1.root --mc --eventcontent AODSIM,MINIAODSIM --runUnscheduled --datatier AODSIM,MINIAODSIM --conditions auto:phase2_realistic --beamspot HLLHC14TeV --customise_commands "process.AODSIMoutput.outputCommands.append('keep recoTrackExtras_generalTracks_*_*')" --step PAT --nThreads 8 --geometry Extended2023D29 --era Phase2_timing --python_filename phase2.py --no_exec

And it also looks reasonable to me:

TimeReport   0.000059     0.000059     0.000059  pfDeepBoostedDiscriminatorsJetTagsSlimmedAK8DeepTags
TimeReport   0.000337     0.000337     0.000337  pfDeepBoostedJetTagInfosSlimmedAK8DeepTags
TimeReport   0.006716     0.006716     0.006716  pfDeepBoostedJetTagsSlimmedAK8DeepTags
TimeReport   0.000062     0.000062     0.000062  pfMassDecorrelatedDeepBoostedDiscriminatorsJetTagsSlimmedAK8DeepTags
TimeReport   0.007515     0.007515     0.007515  pfMassDecorrelatedDeepBoostedJetTagsSlimmedAK8DeepTags

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2018

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2018

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 7, 2018

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-23768/30296/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 500 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 3143975
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3143776
  • DQMHistoTests: Total skipped: 197
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
  • Checked 133 log files, 14 edm output root files, 32 DQM output files

@slava77
Copy link
Contributor

slava77 commented Sep 10, 2018

+1

for #23768 1acac95

  • code changes are in line with the PR description and the follow up review. This PR is expected to modify the miniAOD content in saved AK8 jet discriminants (embedded in the jets): 46 new discriminants were added, compared to 10 previously available.
  • jenkins tests pass and comparisons with the baseline show differences in the AK8 jet discriminants size.
  • [partly based in https://github.com/DeepAK8 tagger integration #23768#issuecomment-417055124] local tests show the cost increase from running the new taggers is acceptable:
    • disk size is up by less than 1% (1% is observed in 1000 event test, but it should get a bit better with more events and compression)
    • CPU use has increased by about 3% for miniAOD-only jobs (~0.3% of total reco time)
    • RSS size increased by 23 MB, of which 11 MB is expected to scale up with the number of threads

@fabiocos
Copy link
Contributor

+1

the python additions look compatible with the recent updates
a data file *params is added in a test area, but it is only 480 bytes large, it can be ok

@fabiocos
Copy link
Contributor

merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants