Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add framework class to enforce schemas for ML feature lists #29818

Open
kpedro88 opened this issue May 12, 2020 · 10 comments
Open

Add framework class to enforce schemas for ML feature lists #29818

kpedro88 opened this issue May 12, 2020 · 10 comments

Comments

@kpedro88
Copy link
Contributor

Originally raised in #29799 (comment).

It would be useful to have a data structure like "edm::featureMap" that enforces a schema for input features for ML algorithms. The schema would include feature names and feature order. This class could have interfaces to output in the formats needed by different ML frameworks (TensorFlow, MXNet, PyTorch, etc.).

Having this would prevent trivial errors in ML output caused by features being provided in the wrong order, etc.

@kpedro88
Copy link
Contributor Author

assign core,reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: core,reconstruction

@Dr15Jones,@smuzaffar,@slava77,@perrotta,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @kpedro88 Kevin Pedro.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented May 12, 2020

are there any real life examples from the "industry"?
Except for perhaps some semblance of structure in TMVA, the features are passed as BLOBs; there will anyways be at least one place where a sequence of named variables will become a presumably correctly aligned BLOB.
(However, my knowledge on options is rather limited.)

@vlimant , some guidance from ML would be nice.
Thank you.

@vlimant
Copy link
Contributor

vlimant commented May 13, 2020

thanks for including me @slava77 , adding @gkasieczka too
I support the idea of @kpedro88 of having such a feature map to find a way to force the formatting of the alignment of the blobs passed down to the ML engine

@kpedro88
Copy link
Contributor Author

To solve this problem in my analysis (which uses a BDT), I implemented a procedure to do this. Code snippets (starting from the CMSSW GBRTree):

class BDTree {
public:
  BDTree() {
    //xml parsing omitted
    const auto& variables = method.child("Variables");
    for(const auto& v : variables.children("Variable")){
      feature_indices_[v.attribute("Expression").as_string()] = std::make_pair(unsigned(v.attribute("VarIndex").as_int()),false);
    }
    features_.resize(feature_indices_.size(),0.);
  }
  float* SetVariable(std::string vname){
    if(feature_indices_.find(vname)==feature_indices_.end()) throw std::runtime_error("Unknown variable: "+vname);
    feature_indices_[vname].second = true;
    return &features_[feature_indices_[vname].first];
  }
private:
  std::vector<GBRTree> trees_;
  std::unordered_map<std::string,std::pair<unsigned,bool>> feature_indices_;
  std::vector<float> features_;
};

class BDTVar {
public:
    BDTVar(string name) : name_(name) {}
    void SetVariable(BDTree* bdtree){ pbranch_ = bdtree->SetVariable(name_); }
    virtual void Fill() {}
protected:
  string name_;
  float* branch_;
};

Individual features are objects that derive from BDTVar, managed by a factory, and override Fill() to specify how to assign the feature value. The feature value is a pointer into the feature vector, therefore automatically retaining the feature order specified in initialization.

This isn't necessarily the path we want to follow for CMSSW, but it offers one example of how to reduce the opportunities for errors.

@slava77
Copy link
Contributor

slava77 commented Jul 25, 2020

@riga @mialiu149

please take a note

@slava77
Copy link
Contributor

slava77 commented Nov 5, 2020

@riga @mialiu149

please take a note

I'm curious if there were some activities related to this issue in the ML group

@riga
Copy link
Contributor

riga commented Nov 13, 2020

I'm curious if there were some activities related to this issue in the ML group

Not yet, but seeing this now, I think we could clearly benefit from a common data structure.

Kevins approach is already quite nice and clear and I like the idea that features keep a pointer to the value to fill. I can image cases, however, where the input data (e.g. a tensor) might change, for instance between events or when a batch size > 1 is used. To avoid having to loop over features to reset these pointers every time, they could store

  • a pointer-to-pointer to the beginning of the tensor data (assuming it's contiguous in memory), which needs to be reset only once by a managing object (the BDTree in Kevins example),
  • and a feature-specific, constant offset based on the feature order.

Also, we could think about the handling of default values. When a feature is "missing" - say "jet4_pt" in an event with only 3 jets - some models expect a very specific value to denote that this feature is missing (nan, -1, ...). It would be convenient to catch this in the same data structure (BDTree:: SetVariable above). This might already go in the direction of describing the model itself rather than its features, but I've seen quite a few issues with this in the past, so it might be worth solving this at once.

We'll try to come up with a draft.

@jpata
Copy link
Contributor

jpata commented May 18, 2022

@riga @yongbinfeng is this planned to be implemented, or should we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants