Add framework class to enforce schemas for ML feature lists #29818

kpedro88 · 2020-05-12T22:17:21Z

It would be useful to have a data structure like "edm::featureMap" that enforces a schema for input features for ML algorithms. The schema would include feature names and feature order. This class could have interfaces to output in the formats needed by different ML frameworks (TensorFlow, MXNet, PyTorch, etc.).

Having this would prevent trivial errors in ML output caused by features being provided in the wrong order, etc.

kpedro88 · 2020-05-12T22:17:29Z

assign core,reconstruction

cmsbuild · 2020-05-12T22:17:38Z

New categories assigned: core,reconstruction

@Dr15Jones,@smuzaffar,@slava77,@perrotta,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2020-05-12T22:17:40Z

A new Issue was created by @kpedro88 Kevin Pedro.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

slava77 · 2020-05-12T23:34:40Z

are there any real life examples from the "industry"?
Except for perhaps some semblance of structure in TMVA, the features are passed as BLOBs; there will anyways be at least one place where a sequence of named variables will become a presumably correctly aligned BLOB.
(However, my knowledge on options is rather limited.)

@vlimant , some guidance from ML would be nice.
Thank you.

vlimant · 2020-05-13T10:29:37Z

thanks for including me @slava77 , adding @gkasieczka too
I support the idea of @kpedro88 of having such a feature map to find a way to force the formatting of the alignment of the blobs passed down to the ML engine

kpedro88 · 2020-05-13T14:07:25Z

To solve this problem in my analysis (which uses a BDT), I implemented a procedure to do this. Code snippets (starting from the CMSSW GBRTree):

class BDTree {
public:
  BDTree() {
    //xml parsing omitted
    const auto& variables = method.child("Variables");
    for(const auto& v : variables.children("Variable")){
      feature_indices_[v.attribute("Expression").as_string()] = std::make_pair(unsigned(v.attribute("VarIndex").as_int()),false);
    }
    features_.resize(feature_indices_.size(),0.);
  }
  float* SetVariable(std::string vname){
    if(feature_indices_.find(vname)==feature_indices_.end()) throw std::runtime_error("Unknown variable: "+vname);
    feature_indices_[vname].second = true;
    return &features_[feature_indices_[vname].first];
  }
private:
  std::vector<GBRTree> trees_;
  std::unordered_map<std::string,std::pair<unsigned,bool>> feature_indices_;
  std::vector<float> features_;
};

class BDTVar {
public:
    BDTVar(string name) : name_(name) {}
    void SetVariable(BDTree* bdtree){ pbranch_ = bdtree->SetVariable(name_); }
    virtual void Fill() {}
protected:
  string name_;
  float* branch_;
};

Individual features are objects that derive from BDTVar, managed by a factory, and override Fill() to specify how to assign the feature value. The feature value is a pointer into the feature vector, therefore automatically retaining the feature order specified in initialization.

This isn't necessarily the path we want to follow for CMSSW, but it offers one example of how to reduce the opportunities for errors.

slava77 · 2020-07-25T20:54:12Z

@riga @mialiu149

please take a note

slava77 · 2020-11-05T14:46:10Z

@riga @mialiu149

please take a note

I'm curious if there were some activities related to this issue in the ML group

riga · 2020-11-13T08:20:29Z

I'm curious if there were some activities related to this issue in the ML group

Not yet, but seeing this now, I think we could clearly benefit from a common data structure.

Kevins approach is already quite nice and clear and I like the idea that features keep a pointer to the value to fill. I can image cases, however, where the input data (e.g. a tensor) might change, for instance between events or when a batch size > 1 is used. To avoid having to loop over features to reset these pointers every time, they could store

a pointer-to-pointer to the beginning of the tensor data (assuming it's contiguous in memory), which needs to be reset only once by a managing object (the BDTree in Kevins example),
and a feature-specific, constant offset based on the feature order.

Also, we could think about the handling of default values. When a feature is "missing" - say "jet4_pt" in an event with only 3 jets - some models expect a very specific value to denote that this feature is missing (nan, -1, ...). It would be convenient to catch this in the same data structure (BDTree:: SetVariable above). This might already go in the direction of describing the model itself rather than its features, but I've seen quite a few issues with this in the past, so it might be worth solving this at once.

We'll try to come up with a draft.

jpata · 2022-05-18T07:36:12Z

@riga @yongbinfeng is this planned to be implemented, or should we close this issue?

cmsbuild added core-pending pending-signatures reconstruction-pending labels May 12, 2020

kpedro88 mentioned this issue May 12, 2020

Fix features order #29799

Merged

kpedro88 mentioned this issue Jul 25, 2020

Add new MVA tau ID for phase II #30341

Merged

slava77 mentioned this issue Jul 27, 2020

fix index typo in PFRecoTauClusterVariables #30919

Merged

jpata mentioned this issue Nov 5, 2020

DeepVertex and DeepJet+DeepVertex combination in release (onnx inference) #31988

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add framework class to enforce schemas for ML feature lists #29818

Add framework class to enforce schemas for ML feature lists #29818

kpedro88 commented May 12, 2020

kpedro88 commented May 12, 2020

cmsbuild commented May 12, 2020

cmsbuild commented May 12, 2020

slava77 commented May 12, 2020

vlimant commented May 13, 2020

kpedro88 commented May 13, 2020

slava77 commented Jul 25, 2020

slava77 commented Nov 5, 2020

riga commented Nov 13, 2020

jpata commented May 18, 2022

Add framework class to enforce schemas for ML feature lists #29818

Add framework class to enforce schemas for ML feature lists #29818

Comments

kpedro88 commented May 12, 2020

kpedro88 commented May 12, 2020

cmsbuild commented May 12, 2020

cmsbuild commented May 12, 2020

slava77 commented May 12, 2020

vlimant commented May 13, 2020

kpedro88 commented May 13, 2020

slava77 commented Jul 25, 2020

slava77 commented Nov 5, 2020

riga commented Nov 13, 2020

jpata commented May 18, 2022