Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give more frequent progress bar updates and update chunk size behavior #631

Merged
merged 39 commits into from Aug 2, 2019

Conversation

@kmax12
Copy link
Member

commented Jul 1, 2019

This PR changes how Featuretools handles chunking cut off times for calculation.

Summary of changes

  • Cutoff time are first grouped by their time value, then broken up so no group is bigger than chunk_size. By default, there is no limit to the chunk_size, unless calculating in parallel in which case chunk size is 1/n_jobs.
  • Only provides time remaining estimate if there is more than one group of cutoff time (i.e can make an estimate based on rows being completed since columns is a bad estimate)
  • Finer grain updates to the progress bar by updating as features are calculated for a given chunk.
@codecov

This comment has been minimized.

Copy link

commented Jul 1, 2019

Codecov Report

Merging #631 into master will increase coverage by 0.01%.
The diff coverage is 99.32%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #631      +/-   ##
==========================================
+ Coverage    97.5%   97.51%   +0.01%     
==========================================
  Files         118      118              
  Lines       10028    10035       +7     
==========================================
+ Hits         9778     9786       +8     
+ Misses        250      249       -1
Impacted Files Coverage Δ
featuretools/computational_backends/utils.py 93.8% <100%> (-1.98%) ⬇️
...utational_backend/test_calculate_feature_matrix.py 99.34% <100%> (-0.02%) ⬇️
featuretools/computational_backends/api.py 100% <100%> (ø) ⬆️
...mputational_backend/test_feature_set_calculator.py 100% <100%> (ø) ⬆️
...computational_backends/calculate_feature_matrix.py 98.88% <100%> (+0.98%) ⬆️
...s/computational_backends/feature_set_calculator.py 98.2% <96.29%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d00799e...0df8c4b. Read the comment docs.

kmax12 added 12 commits Jul 1, 2019

@kmax12 kmax12 changed the title [WIP] Update chunk size behavior and give more frequent progress bar updates [WIP] Give more frequent progress bar updates and update chunk size behavior Jul 10, 2019

kmax12 added 9 commits Jul 10, 2019

@kmax12 kmax12 changed the title [WIP] Give more frequent progress bar updates and update chunk size behavior Give more frequent progress bar updates and update chunk size behavior Jul 11, 2019

@@ -208,7 +116,7 @@ def n_jobs_to_workers(n_jobs):
return workers


def create_client_and_cluster(n_jobs, num_tasks, dask_kwargs, entityset_size):
def create_client_and_cluster(n_jobs, dask_kwargs, entityset_size):

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 11, 2019

Author Member

had to remove num_tasks from here because we use the result of this call to decide how many tasks to create.

warning_string += " Not enough cpu cores ({}).".format(cpu_workers)

if num_tasks < n_jobs:
chunk_warning = " Not enough chunks ({}), consider reducing the chunk size"

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 11, 2019

Author Member

this warning was moved into calculate feature matrix since I remove num_tasks from this function

Returns:
pd.DataFrame : Pandas DataFrame of calculated feature values.
Indexed by instance_ids. Columns in same order as features
passed in.
"""
assert len(instance_ids) > 0, "0 instance ids provided"

if progress_callback is None:
# do nothing for the progress call back if not provided
def progress_callback(*args):

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 11, 2019

Author Member

not sure if this is the best way to handle the no progress_callback scenario. it seemed better than checking everywhere it was called

@@ -500,6 +534,7 @@ def _calculate_agg_features(self, features, frame, df_trie):
features = [f for f in features if f.get_name()
not in frame.columns]
if not len(features):
progress_callback(len(features) / float(self.num_features))

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 11, 2019

Author Member

not immediately sure how to write a test case that hits this situation

def test_warning_not_enough_chunks(es, capsys):
property_feature = IdentityFeature(es['log']['value']) > 10

# this doesn't test that output is correct

This comment has been minimized.

Copy link
@kmax12

kmax12 Jul 11, 2019

Author Member

tried to use capsys as described here, but only the progress bar was in the output, not the written messages

kmax12 added 3 commits Jul 11, 2019
kmax12 and others added 7 commits Jul 11, 2019

@kmax12 kmax12 requested a review from rwedge Jul 23, 2019

kmax12 and others added 3 commits Jul 30, 2019
kmax12 and others added 3 commits Aug 1, 2019
@rwedge
rwedge approved these changes Aug 2, 2019
Copy link
Contributor

left a comment

Looks good!

@kmax12 kmax12 merged commit 815c46b into master Aug 2, 2019

4 checks passed

codecov/patch 99.32% of diff hit (target 97.5%)
Details
codecov/project 97.51% (+0.01%) compared to d00799e
Details
license/cla Contributor License Agreement is signed.
Details
test_all_python_versions Workflow: test_all_python_versions
Details
@rwedge rwedge referenced this pull request Aug 19, 2019

@kmax12 kmax12 deleted the update-progress-bar branch Sep 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.