Skip to content

Drop variables in a batch in normalize_entity #533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 8, 2019

Conversation

CJStadler
Copy link
Contributor

@CJStadler CJStadler commented May 7, 2019

Dropping the additional_variables from the original dataframe takes a significant portion of the time of normalize_entity, and this increases with the number of additional_variables. For example, with 10 variables about 40% of the time is spent in df.drop, compared to 10% for 1 variable.

Currently we call df.drop once for each variable. The number of variables passed to df.drop does not affect its running time (in my benchmarking), so dropping them all at once is O(1) instead of O(n).

Entity.delete_variable is no longer used, but I wasn't sure if it is considered public.

Dropping the "additional_variables" from the original dataframe takes a
significant portion of the time, and this increases with the number of
"additional_variables".

Currently we call df.drop once for each variables. The number of
variables passed to df.drop does not affect the running time (in my
benchmarking), so dropping them all at once is O(1) instead of O(n).
@codecov
Copy link

codecov bot commented May 7, 2019

Codecov Report

Merging #533 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #533      +/-   ##
==========================================
+ Coverage   96.25%   96.26%   +<.01%     
==========================================
  Files         114      114              
  Lines        9245     9253       +8     
==========================================
+ Hits         8899     8907       +8     
  Misses        346      346
Impacted Files Coverage Δ
featuretools/tests/entityset_tests/test_entity.py 100% <100%> (ø) ⬆️
featuretools/entityset/entity.py 96.12% <100%> (+0.01%) ⬆️
featuretools/entityset/entityset.py 95.06% <100%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c1c1a9...d668975. Read the comment docs.

@CJStadler CJStadler requested a review from rwedge May 7, 2019 20:31
@CJStadler
Copy link
Contributor Author

Another significant part of the run time is this sort call: https://github.com/Featuretools/featuretools/blob/master/featuretools/entityset/entityset.py#L744

It's unclear to me why this would be necessary, and removing it doesn't break any of the tests.

self.df.drop(variable_id, axis=1, inplace=True)
v = self._get_variable(variable_id)
self.variables.remove(v)
self.delete_variables([variable_id])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case for this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 And do you think we need to keep delete_variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just remove it. It's unused and delete_variables([variable_id]) achieves the same result

@kmax12
Copy link
Contributor

kmax12 commented May 8, 2019

Another significant part of the run time is this sort call: https://github.com/Featuretools/featuretools/blob/master/featuretools/entityset/entityset.py#L744

It's unclear to me why this would be necessary, and removing it doesn't break any of the tests.

the reason we do the sort is because we have an assumption that dataframes are sorted by their time index. This assumption is import so a primitive like cumulative sum can assume it is getting values in sorted order.

That being said, the sorting here might just be defensive coding and unnecessary if we can verify the dataframe is sorted prior to this call or after this call.

@CJStadler CJStadler requested a review from rwedge May 8, 2019 16:31
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@CJStadler CJStadler merged commit 4b6a5a4 into master May 8, 2019
@CJStadler CJStadler deleted the batch-delete-variables branch May 8, 2019 20:04
@rwedge rwedge mentioned this pull request May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants