-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow dynamic categorical lists #33
Comments
@jzanzig what do you think of an interface like this?
Temporal leakage is worth considering, but maybe not as much as I'm thinking. Even if there are choices returned by that query that didn't exist yet at the time of the called as_of_dates, they will have columns but just won't end up with any data. So maybe that's fine? |
@andreanr also mentioned supporting getting the feature names from lookup tables as well. For instance, using the above interface, you could either
Neither of these are optimal, so solving this in a better way could be worth it. @andreanr do you think the work you've done on this problem could be used here? |
Created dssg/collate#67 for the problem I outlined in the previous comment. It seems more appropriate to solve that problem there than here. For the other use cases (ie denormalized tables, categoricals which actually are ids, and cases where joining to lookup a value is not too bad), I think we can move ahead here with a simpler interface like outlined above. I'm still unsure if there are any leakage problems with creating a list of choices this way. |
I'm not sure I completely understand the difference between the way collate is dealing with categoricals and this way that would introduce information leakage. Are you saying that there's a possibility that the select(distinct) query would return some possible values of color_id that wouldn't be present in that specific training or test set, but when the features themselves are generated the feature will be 0 for all observations? If so, that's fine (and it's not leaking information to have a feature that is 0 for all observations, if that's the correct value), and I think the interface (specifying a choice_query) is good |
Looks like we agree on both points:
Furthermore, based on Matt's comments above it looks like this feature may make it into collate, so we could get to piggyback on that here. |
Going to implement v1 of this here instead of waiting for collate. |
- Add new feature categorical option, 'choice_query', which runs a query to determine a list of Collate `choices` - Cache results of choice query in FeatureGenerator
- Add new feature categorical option, 'choice_query', which runs a query to determine a list of Collate `choices` - Cache results of choice query in FeatureGenerator - Add example of dynamic categorical to example_experiment_config.yaml
Dynamic Categorical Choices [Resolves #33]
Pass matrix directory to metta [Resolves #33]
The FeatureGenerator should have some way of dynamically computing choices for categoricals. One possible interface could be to have a
choice_query
option, for use in place of achoices
option.The text was updated successfully, but these errors were encountered: