-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup groupby transform calculations #609
Conversation
…tools/featuretools into update-groupby-transform
Codecov Report
@@ Coverage Diff @@
## master #609 +/- ##
==========================================
+ Coverage 97.42% 97.42% +<.01%
==========================================
Files 118 118
Lines 9526 9532 +6
==========================================
+ Hits 9281 9287 +6
Misses 245 245
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code looks good. Jupyter notebooks can be tricky to version control so I was wondering what approach you wanted to take for the benchmarks folder? When would we commit changes to the notebook?
For now, what if we save notebooks with a cleared output? That will make them easier to diff. Just updated the current benchmark notebook with a cleared output |
That sounds reasonable |
* change to dictionary * add groupbytransformfeature to top level api * by features * final implementation * linting * add to change log * Update changelog.rst * handle null group key * check for feature vals * Update feature_set_calculator.py * clear output
This PR updates the logic for how we calculate group by transform features.
Previously we iterated over the groups first and then the features, updating the resulting frame greedily as we went along.
Now we iterate over the features, then groups. For each feature, we accumate all the values across all groups and then update the frame just once.
Basically this improves the number of update calls from
num features x num groups
to justnum features
Also added a benchmarks folder to hold the code I used to test these changes