New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add framework class to enforce schemas for ML feature lists #29818
Comments
assign core,reconstruction |
New categories assigned: core,reconstruction @Dr15Jones,@smuzaffar,@slava77,@perrotta,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @kpedro88 Kevin Pedro. @Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
are there any real life examples from the "industry"? @vlimant , some guidance from ML would be nice. |
thanks for including me @slava77 , adding @gkasieczka too |
To solve this problem in my analysis (which uses a BDT), I implemented a procedure to do this. Code snippets (starting from the CMSSW GBRTree): class BDTree {
public:
BDTree() {
//xml parsing omitted
const auto& variables = method.child("Variables");
for(const auto& v : variables.children("Variable")){
feature_indices_[v.attribute("Expression").as_string()] = std::make_pair(unsigned(v.attribute("VarIndex").as_int()),false);
}
features_.resize(feature_indices_.size(),0.);
}
float* SetVariable(std::string vname){
if(feature_indices_.find(vname)==feature_indices_.end()) throw std::runtime_error("Unknown variable: "+vname);
feature_indices_[vname].second = true;
return &features_[feature_indices_[vname].first];
}
private:
std::vector<GBRTree> trees_;
std::unordered_map<std::string,std::pair<unsigned,bool>> feature_indices_;
std::vector<float> features_;
};
class BDTVar {
public:
BDTVar(string name) : name_(name) {}
void SetVariable(BDTree* bdtree){ pbranch_ = bdtree->SetVariable(name_); }
virtual void Fill() {}
protected:
string name_;
float* branch_;
}; Individual features are objects that derive from This isn't necessarily the path we want to follow for CMSSW, but it offers one example of how to reduce the opportunities for errors. |
please take a note |
I'm curious if there were some activities related to this issue in the ML group |
Not yet, but seeing this now, I think we could clearly benefit from a common data structure. Kevins approach is already quite nice and clear and I like the idea that features keep a pointer to the value to fill. I can image cases, however, where the input data (e.g. a tensor) might change, for instance between events or when a batch size > 1 is used. To avoid having to loop over features to reset these pointers every time, they could store
Also, we could think about the handling of default values. When a feature is "missing" - say "jet4_pt" in an event with only 3 jets - some models expect a very specific value to denote that this feature is missing (nan, -1, ...). It would be convenient to catch this in the same data structure ( We'll try to come up with a draft. |
@riga @yongbinfeng is this planned to be implemented, or should we close this issue? |
Originally raised in #29799 (comment).
It would be useful to have a data structure like "edm::featureMap" that enforces a schema for input features for ML algorithms. The schema would include feature names and feature order. This class could have interfaces to output in the formats needed by different ML frameworks (TensorFlow, MXNet, PyTorch, etc.).
Having this would prevent trivial errors in ML output caused by features being provided in the wrong order, etc.
The text was updated successfully, but these errors were encountered: