-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only Build Features for Cohort [Resolves #513] #567
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't really comment beyond a couple superficial things -- up to @saleiro to be more profound -- but, in case this helps....
By default, only build/save features for the given cohort. This is to optimize speed for new users (and visit-based problems) to only what is necessary. For other users, the `save_all_features` flag is introduced. If this is set to True, the old behavior that builds features independently of the cohort is utilized. To make the use of this flag in combination with the replace Flag safe, the feature table "needs features?" check now sees if there are any cohort rows not present in the imputed table and rebuilds if anything is missing.
a852981
to
891dbe5
Compare
Codecov Report
@@ Coverage Diff @@
## master #567 +/- ##
==========================================
- Coverage 83.13% 83.07% -0.06%
==========================================
Files 83 83
Lines 4755 4780 +25
==========================================
+ Hits 3953 3971 +18
- Misses 802 809 +7
Continue to review full report at Codecov.
|
JOIN {cohort_table} ON ( | ||
{cohort_table.entity_id} = {from_obj.entity_id} | ||
AND {cohort_table.date} = {as_of_time} | ||
) | ||
WHERE {knowledge_date_column} >= {as_of_time} - interval {interval} | ||
GROUP BY {group} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I tested with explain analyze and everything looks good!
@@ -110,11 +110,16 @@ a column or SQL expression representing a numeric quantity present in the `from_ | |||
of aggregate functions we want to use. The aggregate function is applied to the quantity. | |||
* Each `group` is a column applied to the GROUP BY clause. Generally this is 'entity_id', but higher-level groupings | |||
(for instance, 'zip_code') can be used as long as they can be rolled up to 'entity_id'. | |||
* By default the query is joined with the cohort table (see 'state table' above) to remove unnecessary rows. If `save_all_features` is passed to the Experiment this is not done. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, save_all_features makes me think of columns and not rows... Maybe something like 'features_ignore_cohort'. or 'features_besides_cohort', or 'features_for_all' :) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. My preference is features_ignore_cohort
, I'll change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, improved a lot my user experience, and the triage db disk requirements.
By default, only build/save features for the given cohort. This is to
optimize speed for new users (and visit-based problems) to only what is
necessary.
For other users, the
save_all_features
flag is introduced. If this isset to True, the old behavior that builds features independently of the
cohort is utilized.
To make the use of this flag in combination with the replace Flag safe,
the feature table "needs features?" check now sees if there are any
cohort rows not present in the imputed table and rebuilds if anything is
missing.