Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of the impute function. #135

Merged
merged 26 commits into from Feb 1, 2017
Merged

Conversation

F-A
Copy link
Contributor

@F-A F-A commented Jan 6, 2017

Improve performance of the functions:

  • utilities.dataframe_function.get_range_values_per_column(df)
  • utilities.dataframe_function.impute(df_impute)
    More specifically: apply the impute function directly on numpy array to improve computation time.

Now the impute function runs in 109ms (60 samples, 14256 features (or columns) ).

Note: I did not impoved the performance of impute_dataframe_range(...) since it would have been to much of a hassle to implement all the checks in that function, e.g. in case the min/max/median values of each columns are not present. In our case we call the get_range_values_per_column just before so these checks are not necessary.
So I just reimplemented the function impute_dataframe_rage directly in the impute function. This is less modular. Maybe you could pack this code in a new impute_dataframe_range function.

Solve the issue #123 .

@coveralls
Copy link

coveralls commented Jan 6, 2017

Coverage Status

Coverage increased (+0.04%) to 96.727% when pulling c3205e9 on F-A:master into 32a53ca on blue-yonder:master.

Copy link
Collaborator

@MaxBenChrist MaxBenChrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great @F-A , thank you for helping to make tsfresh faster.

df_impute.values[ indices ] = newValues

# Ensure a type of "np.float64"
df_impute.astype(np.float64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to set copy=False here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise it will just return a copy

@MaxBenChrist
Copy link
Collaborator

Note: I did not impoved the performance of impute_dataframe_range(...) since it would have been to much of a hassle to implement all the checks in that function, e.g. in case the min/max/median values of each columns are not present.

I think the only check that is missing in your version is if there are no finite values at all

            if len(finite_values_in_column) == 0:
                _logger.warning(
                    "The replacement column {} did not have any finite values. Filling with zeros.".format(column_name))
                df_impute[column_name] = [0] * len(column)
                continue

@F-A
Copy link
Contributor Author

F-A commented Jan 6, 2017

The check in case there no finite values in done in the "get_range_values_per_column" function. Line 173-175. But it does not log anything.

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Jan 6, 2017

Ah yes you are right, couldn't you do something like

if sum(masked.mask.sum(axis=0) == masked.data.shape[0]) > 0
    #print logging message

just before line 173?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 96.727% when pulling 84cfa62 on F-A:master into 32a53ca on blue-yonder:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 96.727% when pulling 84cfa62 on F-A:master into 32a53ca on blue-yonder:master.

@F-A
Copy link
Contributor Author

F-A commented Jan 6, 2017

The log message was previously done in the impute function directly. In this function I do not create the masked array.
I could add the warning in the "get_range_values_per_column" where I have the masked array but this does make less sense...

@jneuff
Copy link
Collaborator

jneuff commented Jan 6, 2017

@F-A Nice work!

The impute function is provided as a default for users who don't know which impute function to choose (we currently have two: impute_dataframe_range and impute_zero). From the semantics you're proposal is an improvement of the impute_dataframe_range function. So that's the function it should replace and impute should remain just a wrapper.

@jneuff
Copy link
Collaborator

jneuff commented Jan 6, 2017

Ah, and while we're at it: I'd prefer the impute function to return the altered dataframe. Currently we're inconsistent about this.

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Jan 6, 2017

Ah, and while we're at it: I'd prefer the impute function to return the altered dataframe. Currently we're inconsistent about this.

The problem is that you create (temporary) copys of the dataframe if you impute that way. For big dataframes this could result in out of memory errors. I think we should do all imputing inplace

@F-A
Copy link
Contributor Author

F-A commented Jan 6, 2017

Rewriting the impute_dataframe_range with its current behavior would take me too much time: it assumes the max/min and median dict given in parameters can be incomplete, which results in a lot of column-wise checks. The impute function directly compute these dictionaries, so we know they are complete, which make the implementation easier.

As I mentioned earlier, I do not have much time to accord to tsfresh. I just wanted to test it on my data and I fixed this issue to do so. This PR correspond to my fix.
You can accept it if you want and improve the functions "impute_dataframe_range" and "impute_zero" in the same spirit later on. But this will be too much for me right now, sorry.

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Jan 6, 2017

Thank you. no worries, we will take over from here.

@F-A
Copy link
Contributor Author

F-A commented Jan 7, 2017 via email

@MaxBenChrist
Copy link
Collaborator

Yes of course. I'll grant you access to it when I am in front of a computer.

Perfect, let me know when you arranged it.

@jneuff
Copy link
Collaborator

jneuff commented Jan 8, 2017

The problem is that you create (temporary) copys of the dataframe if you impute that way. For big dataframes this could result in out of memory errors. I think we should do all imputing inplace

We could do it in place and return the dataframe. That way, we won't have to change the function if we later decide to make our data flow more explicit as some kind of pipeline.

@F-A No, worries. Thanks for contributing!

I just noticed, that we currently have quite some code duplication between impute_dataframe_range and get_range_values_per_column. We should simplify that anyway. This PR seems like a good opportunity.

@F-A
Copy link
Contributor Author

F-A commented Jan 8, 2017

Ok I added both of you as collabortors for my fork. You should be able to push then.

@MaxBenChrist
Copy link
Collaborator

Thanks, let's use this opportunity to refactor the module. I will have a look at it during the next days.

@nils-braun
Copy link
Collaborator

@MaxBenChrist Is there news on this branch and PR? Or do you want to tell us your plans for refactoring - so we can help out? ;-)

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Jan 23, 2017

I do not have a fixed plan yet but some aspects I want to tackle

  • remove code duplication between impute_dataframe_range and get_range_values_per_column
  • maybe remove impute_dataframe_zero
  • check if we still need get_range_values_per_column if we use the masks
  • ...

Maybe I will find some time during the next days to do this.

@MaxBenChrist
Copy link
Collaborator

I finally found some hours to spend on this. I finished refactoring the dataframe_functions module.

However, I found that the usage of the impute functions in the transformers is not right. I will try to fix that as well

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Jan 31, 2017

Well, I finished everything. If either @nils-braun or @jneuff gives his go, we can merge this thing.

Edit: There are still some issues with python 3.5, I will fix them .

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 96.815% when pulling b2ceb90 on F-A:master into 31ee428 on blue-yonder:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 96.815% when pulling 2601cb8 on F-A:master into 31ee428 on blue-yonder:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage increased (+0.09%) to 96.815% when pulling 9cfa13f on F-A:master into 31ee428 on blue-yonder:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage increased (+0.09%) to 96.815% when pulling 8f65f5b on F-A:master into 31ee428 on blue-yonder:master.

@coveralls
Copy link

coveralls commented Feb 1, 2017

Coverage Status

Coverage increased (+0.09%) to 96.815% when pulling 8f65f5b on F-A:master into 31ee428 on blue-yonder:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 96.815% when pulling 49d5121 on F-A:master into 31ee428 on blue-yonder:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 96.815% when pulling 49d5121 on F-A:master into 31ee428 on blue-yonder:master.

@jneuff
Copy link
Collaborator

jneuff commented Feb 1, 2017

Thanks @F-A and @MaxBenChrist !

@jneuff jneuff merged commit e3dea11 into blue-yonder:master Feb 1, 2017
@MaxBenChrist
Copy link
Collaborator

Awesome! We are on the way to make tsfresh blazing fast ;)

@F-A
Copy link
Contributor Author

F-A commented Feb 2, 2017

Nice refactoring! I'm glad my original investigations helped you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants