New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved chunking when calculating feature matrices #121

Merged
merged 38 commits into from Mar 28, 2018

Conversation

Projects
None yet
5 participants
@rwedge
Contributor

rwedge commented Mar 26, 2018

Instead of calculating all rows of a feature matrix that share the same cutoff time together, Featuretools breaks the feature matrix rows into chunks to calculate separately, prioritizing grouping rows with the same cutoff time in the same chunk.

ft.dfs and ft.calculate_feature_matrix now have a chunk_size parameter to allow for custom chunk sizes. chunk_size accepts positive integers for explicit chunk sizes or floats between 0 and 1 for percentage-based chunk sizes. The old 'group all rows that share a cutoff time' method can be used by setting the string "cutoff time" as the chunk size.

There is also new information in the documentation about using chunking and other parameters to improve the performance of Featuretools

@kmax12 kmax12 changed the title from Chunking to Improved chunking when calculating feature matrices Mar 26, 2018

@codecov-io

This comment has been minimized.

codecov-io commented Mar 26, 2018

Codecov Report

Merging #121 into master will increase coverage by 0.19%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #121      +/-   ##
==========================================
+ Coverage   88.27%   88.47%   +0.19%     
==========================================
  Files          73       73              
  Lines        7457     7558     +101     
==========================================
+ Hits         6583     6687     +104     
+ Misses        874      871       -3
Impacted Files Coverage Δ
featuretools/synthesis/dfs.py 100% <ø> (ø) ⬆️
featuretools/computational_backends/api.py 100% <ø> (ø) ⬆️
...computational_backends/calculate_feature_matrix.py 98.59% <100%> (+1.7%) ⬆️
...utational_backend/test_calculate_feature_matrix.py 98.93% <100%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f1b813...97cfd5a. Read the comment docs.

@@ -82,6 +82,18 @@ def calculate_feature_matrix(features, cutoff_time=None, instance_ids=None,
profile (bool, optional): Enables profiling if True.
chunk_size (int or float or None or "cutoff time"): Instead of

This comment has been minimized.

@kmax12

kmax12 Mar 26, 2018

Member

because we have the usage guide, let's just make this description

Number of rows of output feature matrix to calculate at time. If passed an integer greater than 0, will try to use that many rows per chunk. If passed a float value between 0 and 1 sets the chunk size to that percentage of all instances. If passed the string "cutoff time", rows are split per cutoff time.

chunks = []
for group_name in groups:

This comment has been minimized.

@kmax12

kmax12 Mar 26, 2018

Member

I can understand the logic in this function, but can you add a some brief comments explaining?

else:
for chunk in iterator:
chunks.append(chunk)
pbar_string = ("Elapsed: {elapsed} | Remaining: {remaining} | "
"Progress: {l_bar}{bar}|| "

This comment has been minimized.

@kmax12

kmax12 Mar 26, 2018

Member

can we get rid of the || after the progress bar? the output looks better without it

@kmax12

This comment has been minimized.

Member

kmax12 commented Mar 28, 2018

Good work! Merging in

@kmax12 kmax12 merged commit d60c664 into master Mar 28, 2018

1 of 2 checks passed

ci/circleci CircleCI is running your tests
Details
license/cla Contributor License Agreement is signed.
Details

@rwedge rwedge referenced this pull request Apr 13, 2018

Merged

Release v0.1.20 #131

rwedge added a commit that referenced this pull request Apr 13, 2018

Release v0.1.20 (#131)
**v0.1.20** Apr 13, 2018
* Improved chunking when calculating feature matrices  (#121)
* Primitives as strings in DFS parameters (#129)
* Integer time index bugfixes (#128)
* Add make_temporal_cutoffs utility function (#126)
* Show all entities, switch shape display to row/col (#124)
* fixed num characters nan fix (#118)
* modify ignore_variables docstring (#117)

@kmax12 kmax12 deleted the chunking branch Jun 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment