Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup groupby transform calculations #609

merged 17 commits into from Jun 24, 2019


None yet
2 participants
Copy link

commented Jun 19, 2019

This PR updates the logic for how we calculate group by transform features.

Previously we iterated over the groups first and then the features, updating the resulting frame greedily as we went along.

Now we iterate over the features, then groups. For each feature, we accumate all the values across all groups and then update the frame just once.

Basically this improves the number of update calls from num features x num groups to just num features

Also added a benchmarks folder to hold the code I used to test these changes

kmax12 added some commits Jun 19, 2019


This comment has been minimized.

Copy link

commented Jun 19, 2019

Codecov Report

Merging #609 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #609      +/-   ##
+ Coverage   97.42%   97.42%   +<.01%     
  Files         118      118              
  Lines        9526     9532       +6     
+ Hits         9281     9287       +6     
  Misses        245      245
Impacted Files Coverage Δ
...s/computational_backends/ 98.1% <100%> (+0.03%) ⬆️
featuretools/ 66.66% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6ebadc...e690733. Read the comment docs.

@kmax12 kmax12 requested a review from rwedge Jun 19, 2019

Copy link

left a comment

I think the code looks good. Jupyter notebooks can be tricky to version control so I was wondering what approach you wanted to take for the benchmarks folder? When would we commit changes to the notebook?


This comment has been minimized.

Copy link
Member Author

commented Jun 20, 2019

For now, what if we save notebooks with a cleared output? That will make them easier to diff. Just updated the current benchmark notebook with a cleared output


This comment has been minimized.

Copy link

commented Jun 20, 2019

That sounds reasonable

@kmax12 kmax12 merged commit 50f335f into master Jun 24, 2019

4 checks passed

codecov/patch 100% of diff hit (target 97.42%)
codecov/project 97.42% (+<.01%) compared to f6ebadc
license/cla Contributor License Agreement is signed.
test_all_python_versions Workflow: test_all_python_versions

@rwedge rwedge referenced this pull request Jul 3, 2019


v0.9.1 #640

johnnyheineken pushed a commit to johnnyheineken/featuretools that referenced this pull request Jul 7, 2019

Speedup groupby transform calculations (Featuretools#609)
* change to dictionary

* add groupbytransformfeature to top level api

* by features

* final implementation

* linting

* add to change log

* Update changelog.rst

* handle null group key

* check for feature vals

* Update

* clear output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.