Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup groupby transform calculations #609

Merged
merged 17 commits into from
Jun 24, 2019
Merged

Conversation

kmax12
Copy link
Contributor

@kmax12 kmax12 commented Jun 19, 2019

This PR updates the logic for how we calculate group by transform features.

Previously we iterated over the groups first and then the features, updating the resulting frame greedily as we went along.

Now we iterate over the features, then groups. For each feature, we accumate all the values across all groups and then update the frame just once.

Basically this improves the number of update calls from num features x num groups to just num features

Also added a benchmarks folder to hold the code I used to test these changes

@codecov
Copy link

codecov bot commented Jun 19, 2019

Codecov Report

Merging #609 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #609      +/-   ##
==========================================
+ Coverage   97.42%   97.42%   +<.01%     
==========================================
  Files         118      118              
  Lines        9526     9532       +6     
==========================================
+ Hits         9281     9287       +6     
  Misses        245      245
Impacted Files Coverage Δ
...s/computational_backends/feature_set_calculator.py 98.1% <100%> (+0.03%) ⬆️
featuretools/__init__.py 66.66% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6ebadc...e690733. Read the comment docs.

@kmax12 kmax12 requested a review from rwedge June 19, 2019 22:03
Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks good. Jupyter notebooks can be tricky to version control so I was wondering what approach you wanted to take for the benchmarks folder? When would we commit changes to the notebook?

@kmax12
Copy link
Contributor Author

kmax12 commented Jun 20, 2019

For now, what if we save notebooks with a cleared output? That will make them easier to diff. Just updated the current benchmark notebook with a cleared output

@rwedge
Copy link
Contributor

rwedge commented Jun 20, 2019

That sounds reasonable

rwedge
rwedge previously approved these changes Jun 20, 2019
@kmax12 kmax12 merged commit 50f335f into master Jun 24, 2019
@rwedge rwedge mentioned this pull request Jul 3, 2019
johnnyheineken pushed a commit to johnnyheineken/featuretools that referenced this pull request Jul 7, 2019
* change to dictionary

* add groupbytransformfeature to top level api

* by features

* final implementation

* linting

* add to change log

* Update changelog.rst

* handle null group key

* check for feature vals

* Update feature_set_calculator.py

* clear output
@rwedge rwedge deleted the update-groupby-transform branch February 19, 2021 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants