Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow dynamic categorical lists #33

Closed
thcrock opened this issue Mar 1, 2017 · 6 comments
Closed

Allow dynamic categorical lists #33

thcrock opened this issue Mar 1, 2017 · 6 comments
Assignees

Comments

@thcrock
Copy link
Contributor

thcrock commented Mar 1, 2017

The FeatureGenerator should have some way of dynamically computing choices for categoricals. One possible interface could be to have a choice_query option, for use in place of a choices option.

@thcrock thcrock added this to the v0.1 milestone Mar 1, 2017
@thcrock
Copy link
Contributor Author

thcrock commented Mar 2, 2017

@jzanzig what do you think of an interface like this?

        categoricals:
            -
                column: 'color_id'
                choice_query: 'select distinct(color_id) from myschema.colors'
                metrics: ['sum']

Temporal leakage is worth considering, but maybe not as much as I'm thinking. Even if there are choices returned by that query that didn't exist yet at the time of the called as_of_dates, they will have columns but just won't end up with any data. So maybe that's fine?

@thcrock
Copy link
Contributor Author

thcrock commented Mar 3, 2017

@andreanr also mentioned supporting getting the feature names from lookup tables as well. For instance, using the above interface, you could either

  1. Get feature names that have color ids in them instead of the more helpful color names
  2. Use a join query in your spacetime aggregation's from_obj, and use the joined name instead of the id. A bunch of joins may make the queries take longer.

Neither of these are optimal, so solving this in a better way could be worth it. @andreanr do you think the work you've done on this problem could be used here?

@thcrock
Copy link
Contributor Author

thcrock commented Mar 3, 2017

Created dssg/collate#67 for the problem I outlined in the previous comment. It seems more appropriate to solve that problem there than here.

For the other use cases (ie denormalized tables, categoricals which actually are ids, and cases where joining to lookup a value is not too bad), I think we can move ahead here with a simpler interface like outlined above. I'm still unsure if there are any leakage problems with creating a list of choices this way.

@jzanzig
Copy link

jzanzig commented Mar 6, 2017

I'm not sure I completely understand the difference between the way collate is dealing with categoricals and this way that would introduce information leakage. Are you saying that there's a possibility that the select(distinct) query would return some possible values of color_id that wouldn't be present in that specific training or test set, but when the features themselves are generated the feature will be 0 for all observations? If so, that's fine (and it's not leaking information to have a feature that is 0 for all observations, if that's the correct value), and I think the interface (specifying a choice_query) is good

@thcrock
Copy link
Contributor Author

thcrock commented Mar 6, 2017

Looks like we agree on both points:

  1. That the current workarounds for dynamic categoricals work no differently with regards to information leakage than this as proposed, just with more work
  2. That this doesn't actually introduce information leakage (I was pretty sure of this but wanted to confirm)

Furthermore, based on Matt's comments above it looks like this feature may make it into collate, so we could get to piggyback on that here.

@thcrock
Copy link
Contributor Author

thcrock commented Mar 16, 2017

Going to implement v1 of this here instead of waiting for collate.

@thcrock thcrock self-assigned this Mar 16, 2017
thcrock added a commit that referenced this issue Mar 16, 2017
- Add new feature categorical option, 'choice_query', which runs a query to determine a list of Collate `choices`
- Cache results of choice query in FeatureGenerator
thcrock added a commit that referenced this issue Mar 16, 2017
- Add new feature categorical option, 'choice_query', which runs a query to determine a list of Collate `choices`
- Cache results of choice query in FeatureGenerator
- Add example of dynamic categorical to example_experiment_config.yaml
@thcrock thcrock removed this from the v0.2 milestone Mar 22, 2017
ecsalomon added a commit that referenced this issue Mar 28, 2017
Dynamic Categorical Choices [Resolves #33]
jesteria pushed a commit that referenced this issue Nov 29, 2017
Pass matrix directory to metta [Resolves #33]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants