-
Notifications
You must be signed in to change notification settings - Fork 895
Drop variables in a batch in normalize_entity
#533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Dropping the "additional_variables" from the original dataframe takes a significant portion of the time, and this increases with the number of "additional_variables". Currently we call df.drop once for each variables. The number of variables passed to df.drop does not affect the running time (in my benchmarking), so dropping them all at once is O(1) instead of O(n).
Codecov Report
@@ Coverage Diff @@
## master #533 +/- ##
==========================================
+ Coverage 96.25% 96.26% +<.01%
==========================================
Files 114 114
Lines 9245 9253 +8
==========================================
+ Hits 8899 8907 +8
Misses 346 346
Continue to review full report at Codecov.
|
Another significant part of the run time is this sort call: https://github.com/Featuretools/featuretools/blob/master/featuretools/entityset/entityset.py#L744 It's unclear to me why this would be necessary, and removing it doesn't break any of the tests. |
featuretools/entityset/entity.py
Outdated
self.df.drop(variable_id, axis=1, inplace=True) | ||
v = self._get_variable(variable_id) | ||
self.variables.remove(v) | ||
self.delete_variables([variable_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case for this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 And do you think we need to keep delete_variable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just remove it. It's unused and delete_variables([variable_id])
achieves the same result
the reason we do the sort is because we have an assumption that dataframes are sorted by their time index. This assumption is import so a primitive like cumulative sum can assume it is getting values in sorted order. That being said, the sorting here might just be defensive coding and unnecessary if we can verify the dataframe is sorted prior to this call or after this call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Dropping the
additional_variables
from the original dataframe takes a significant portion of the time ofnormalize_entity
, and this increases with the number ofadditional_variables
. For example, with 10 variables about 40% of the time is spent indf.drop
, compared to 10% for 1 variable.Currently we call
df.drop
once for each variable. The number of variables passed todf.drop
does not affect its running time (in my benchmarking), so dropping them all at once is O(1) instead of O(n).Entity.delete_variable
is no longer used, but I wasn't sure if it is considered public.