Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop variables in a batch in `normalize_entity` #533

Merged
merged 3 commits into from May 8, 2019

Conversation

Projects
None yet
3 participants
@CJStadler
Copy link
Contributor

commented May 7, 2019

Dropping the additional_variables from the original dataframe takes a significant portion of the time of normalize_entity, and this increases with the number of additional_variables. For example, with 10 variables about 40% of the time is spent in df.drop, compared to 10% for 1 variable.

Currently we call df.drop once for each variable. The number of variables passed to df.drop does not affect its running time (in my benchmarking), so dropping them all at once is O(1) instead of O(n).

Entity.delete_variable is no longer used, but I wasn't sure if it is considered public.

Delete variables in a batch in normalize_entity
Dropping the "additional_variables" from the original dataframe takes a
significant portion of the time, and this increases with the number of
"additional_variables".

Currently we call df.drop once for each variables. The number of
variables passed to df.drop does not affect the running time (in my
benchmarking), so dropping them all at once is O(1) instead of O(n).
@codecov

This comment has been minimized.

Copy link

commented May 7, 2019

Codecov Report

Merging #533 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #533      +/-   ##
==========================================
+ Coverage   96.25%   96.26%   +<.01%     
==========================================
  Files         114      114              
  Lines        9245     9253       +8     
==========================================
+ Hits         8899     8907       +8     
  Misses        346      346
Impacted Files Coverage Δ
featuretools/tests/entityset_tests/test_entity.py 100% <100%> (ø) ⬆️
featuretools/entityset/entity.py 96.12% <100%> (+0.01%) ⬆️
featuretools/entityset/entityset.py 95.06% <100%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c1c1a9...d668975. Read the comment docs.

@CJStadler CJStadler requested a review from rwedge May 7, 2019

@CJStadler

This comment has been minimized.

Copy link
Contributor Author

commented May 8, 2019

Another significant part of the run time is this sort call: https://github.com/Featuretools/featuretools/blob/master/featuretools/entityset/entityset.py#L744

It's unclear to me why this would be necessary, and removing it doesn't break any of the tests.

self.df.drop(variable_id, axis=1, inplace=True)
v = self._get_variable(variable_id)
self.variables.remove(v)
self.delete_variables([variable_id])

This comment has been minimized.

Copy link
@rwedge

rwedge May 8, 2019

Contributor

Can you add a test case for this method

This comment has been minimized.

Copy link
@CJStadler

CJStadler May 8, 2019

Author Contributor

👍 And do you think we need to keep delete_variable?

This comment has been minimized.

Copy link
@rwedge

rwedge May 8, 2019

Contributor

Let's just remove it. It's unused and delete_variables([variable_id]) achieves the same result

@kmax12

This comment has been minimized.

Copy link
Member

commented May 8, 2019

Another significant part of the run time is this sort call: https://github.com/Featuretools/featuretools/blob/master/featuretools/entityset/entityset.py#L744

It's unclear to me why this would be necessary, and removing it doesn't break any of the tests.

the reason we do the sort is because we have an assumption that dataframes are sorted by their time index. This assumption is import so a primitive like cumulative sum can assume it is getting values in sorted order.

That being said, the sorting here might just be defensive coding and unnecessary if we can verify the dataframe is sorted prior to this call or after this call.

@CJStadler CJStadler requested a review from rwedge May 8, 2019

@rwedge

rwedge approved these changes May 8, 2019

Copy link
Contributor

left a comment

Looks good

@CJStadler CJStadler merged commit 4b6a5a4 into master May 8, 2019

4 checks passed

codecov/patch 100% of diff hit (target 96.25%)
Details
codecov/project 96.26% (+<.01%) compared to 7c1c1a9
Details
license/cla Contributor License Agreement is signed.
Details
test_all_python_versions Workflow: test_all_python_versions
Details

@CJStadler CJStadler deleted the batch-delete-variables branch May 8, 2019

@rwedge rwedge referenced this pull request May 17, 2019

Merged

v0.8.0 #548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.