Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add DeepJet inputs #39

Merged
merged 12 commits into from
Dec 22, 2021
Merged

feat: add DeepJet inputs #39

merged 12 commits into from
Dec 22, 2021

Conversation

AnnikaStein
Copy link
Contributor

  • Add custom producer to get information for 25 cpfs, 25 npfs and 4 SVs (similar to JetConstituentTableProducer), but store these in the already existing Jet table (flat, no new collections) - if requested via add_DeepJet and keep_inputs
  • Include full output nodes for DeepJet (if requested via add_DeepJet) - I guess Keep full output nodes for DeepCSV and DeepJet #35 is related
  • Second new option add_DeepJet_noclip, which adds more variables on top of add_DeepJet: uses raw quantities without any preprocessing by catch_infs_and_bound (variables are not clipped / scaled)

@andrzejnovak

@AnnikaStein
Copy link
Contributor Author

AnnikaStein commented Dec 12, 2021

  • Access DeepJet inputs via TagInfoCollection in a new, custom producer, similar to JetConstituentTableProducer.cc, but using the logic of DeepFlavourONNXJetTagsProducer.cc
  • No more recalculation of features necessary (but could of course revert to the old way and compare - my personal checks showed they are in good agreement) - see this notebook for details (and scroll all the way down, I'm sorry)
  • Minor clean up in other files

Note: size and descriptions for 5000 events when DeepJet inputs (and output probabilities) are added. (Cf. old size and descriptions from recalculating variables. See also the same number of events, without DeepJet here: size and description.)

python/addBTV.py Outdated
@@ -6,7 +6,7 @@
from PhysicsTools.PatAlgos.tools.helpers import addToProcessAndTask, getPatAlgosToolsTask


def update_jets_AK4(process):
def update_jets_AK4(process, add_DeepJet):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would shuffle to logic around a bit here. Have the DeepCSV/DeepJet output nodes in as the default (~10 branches anyway) and keep the DeepCSV/DeepJet inputs as toggles

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. And would you prefer one individual toggle per type of tagger or use only the keepInputs toggle for enabling all tagger inputs at the same time? Don't know if there are use cases where one only needs e.g. DeepCSV but not DeepJet, or DDX...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking as we move on we might want to be dropping DeepCSV, also not all studies will need all taggers. I think my preference would be passing a list with up to ['DeepCSV', 'DeepJet', 'DDX'] and pattern match it after, but having it as 3 bools is not that bad either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so that would make 2^3=8 possible combinations to be added in pfnano_cff.py for each MC or Data configuration, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you go with bools, I'd say it's not necessary that all combinations are part of the cff, can be added as/if needed

python/addBTV.py Outdated
float,
doc="DeepJet gluon tag probability",
precision=10),
# anstein: could add discriminators here (or just leave it as it is, only output nodes with individual probabilities)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add the discriminators, it's just 3 branches anyway, B, Cvl, CvB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw that these discriminators are already included via nano (declared inside jets_cff.py):

'Jet_btagDeepFlavB',
'Jet_btagDeepFlavC',  # this one only until 106x
'Jet_btagDeepFlavCvB',
'Jet_btagDeepFlavCvL',
'Jet_btagDeepFlavQG',

So I can probably keep it as it is? (btagDeepFlavC will be duplicated, depending on the era (<= 106x), discriminators are always in, not dependent on the era, individual probabilities I add manually, not dependent on the era)

See
https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/NanoAOD/python/jets_cff.py#L233-L239
for the discriminators

and
https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/NanoAOD/python/jets_cff.py#L268
for btagDeepFlavC

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 indeed, in fact if you could check there aren't any duplicate branches in the current output fill overall that would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def duplicates(mykeys):
    seen = set()
    for x in mykeys:
        if x in seen: 
            print(x)
        seen.add(x)
    return

duplicates(keys)

used for the configuration PFnano_customizeMC_add_DeepJet gives me this output

FatJet_btagDDBvLV2
FatJet_btagDDCvBV2
FatJet_btagDDCvLV2
Jet_btagDeepFlavC
FatJet_nBHadrons
FatJet_nCHadrons
SubJet_nBHadrons
SubJet_nCHadrons

Jet_btagDeepFlavC was expected, as explained above, the others are not related to what I added or modified in this PR I guess.

const edm::EDGetTokenT<TagInfoCollection> tag_info_src_;
// anstein: could be used later to manually change number of features and constituents per feature
/*
constexpr static unsigned n_features_global_ = 15;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better to use these, than to hardcode the number later on in each entry



// ============================================================== Cpfs ===================================================================
input_name = "DeepJet_Cpfcan_puppiw_" + s;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these can be formated better? Defining and overwriting description and inpu_name for each entry is redundant

Copy link
Collaborator

@andrzejnovak andrzejnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, looks good to me. I left a couple of comments, mostly to clean up commented out code, but then we can merge


edm::EDGetTokenT<edm::View<T>> jet_token_;

// anstein: from ONNX producer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// anstein: from ONNX producer

typedef std::vector<reco::DeepFlavourTagInfo> TagInfoCollection;
const edm::EDGetTokenT<TagInfoCollection> tag_info_src_;

// anstein: number of constituents will be used, number of features per constituent not necessary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// anstein: number of constituents will be used, number of features per constituent not necessary

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove commented out stuff here

for (unsigned int p = 0; p < n_cpf_; p++) {
auto s = std::to_string(p);

djTable->addColumn<float>("DeepJet_Cpfcan_puppiw_" + s, Cpfcan_puppiw_nCpf[p], "charged candidate PUPPI weight of the " + s + ". cpf", nanoaod::FlatTable::FloatColumn, 10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
djTable->addColumn<float>("DeepJet_Cpfcan_puppiw_" + s, Cpfcan_puppiw_nCpf[p], "charged candidate PUPPI weight of the " + s + ". cpf", nanoaod::FlatTable::FloatColumn, 10);
djTable->addColumn<float>("DeepJet_Cpfcan_puppiw_" + s,
Cpfcan_puppiw_nCpf[p],
"charged candidate PUPPI weight of the " + s + ". cpf",
nanoaod::FlatTable::FloatColumn, 10);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we format like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can! :) will change all affected lines and split them to improve readability.

djTable->addColumn<float>("DeepJet_sv_d3dsig_" + s, sv_d3dsig_nSV[p], "3D impact parameter (flight distance) significance of the " + s + ". SV", nanoaod::FlatTable::FloatColumn, 10);
djTable->addColumn<float>("DeepJet_sv_costhetasvpv_" + s, sv_costhetasvpv_nSV[p], "cosine of the angle between the " + s + ". SV flight direction and the direction of the " + s + ". SV momentum", nanoaod::FlatTable::FloatColumn, 10);
/*
// anstein: only relevant if also included in the tag info, not yet, maybe in future versions of the tagger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep these, as they might be relevant later, but as a rule most comments could/should be removed

@AnnikaStein
Copy link
Contributor Author

All requested changes should be in now, but as I'm only Author, not Collaborator, I think someone else has to merge this PR.

@andrzejnovak andrzejnovak changed the title Add DeepJet inputs feat: add DeepJet inputs Dec 22, 2021
@andrzejnovak andrzejnovak merged commit bbb8657 into cms-jet:master Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants