New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of the impute function. #135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great @F-A , thank you for helping to make tsfresh faster.
df_impute.values[ indices ] = newValues | ||
|
||
# Ensure a type of "np.float64" | ||
df_impute.astype(np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have to set copy=False
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise it will just return a copy
I think the only check that is missing in your version is if there are no finite values at all if len(finite_values_in_column) == 0:
_logger.warning(
"The replacement column {} did not have any finite values. Filling with zeros.".format(column_name))
df_impute[column_name] = [0] * len(column)
continue |
The check in case there no finite values in done in the "get_range_values_per_column" function. Line 173-175. But it does not log anything. |
Ah yes you are right, couldn't you do something like if sum(masked.mask.sum(axis=0) == masked.data.shape[0]) > 0
#print logging message just before line 173? |
1 similar comment
The log message was previously done in the impute function directly. In this function I do not create the masked array. |
@F-A Nice work! The impute function is provided as a default for users who don't know which impute function to choose (we currently have two: impute_dataframe_range and impute_zero). From the semantics you're proposal is an improvement of the impute_dataframe_range function. So that's the function it should replace and impute should remain just a wrapper. |
Ah, and while we're at it: I'd prefer the impute function to return the altered dataframe. Currently we're inconsistent about this. |
The problem is that you create (temporary) copys of the dataframe if you impute that way. For big dataframes this could result in out of memory errors. I think we should do all imputing inplace |
Rewriting the impute_dataframe_range with its current behavior would take me too much time: it assumes the max/min and median dict given in parameters can be incomplete, which results in a lot of column-wise checks. The impute function directly compute these dictionaries, so we know they are complete, which make the implementation easier. As I mentioned earlier, I do not have much time to accord to tsfresh. I just wanted to test it on my data and I fixed this issue to do so. This PR correspond to my fix. |
Thank you. no worries, we will take over from here. |
Yes of course. I'll grant you access to it when I am in front of a computer.
On 6 Jan 2017 23:46, "Maximilian Christ" <notifications@github.com> wrote:
Thank you. no worries, we will take over from here.
May I push some changes to your master branch to update this pr?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#135 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFDP3QMVHqgJkvBFZ9q6n7dM6HOmyaV0ks5rPsRBgaJpZM4LcuDU>
.
|
Perfect, let me know when you arranged it. |
We could do it in place and return the dataframe. That way, we won't have to change the function if we later decide to make our data flow more explicit as some kind of pipeline. @F-A No, worries. Thanks for contributing! I just noticed, that we currently have quite some code duplication between |
Ok I added both of you as collabortors for my fork. You should be able to push then. |
Thanks, let's use this opportunity to refactor the module. I will have a look at it during the next days. |
@MaxBenChrist Is there news on this branch and PR? Or do you want to tell us your plans for refactoring - so we can help out? ;-) |
I do not have a fixed plan yet but some aspects I want to tackle
Maybe I will find some time during the next days to do this. |
if dict does not contain value for column return value error
I finally found some hours to spend on this. I finished refactoring the dataframe_functions module. However, I found that the usage of the impute functions in the transformers is not right. I will try to fix that as well |
Well, I finished everything. If either @nils-braun or @jneuff gives his go, we can merge this thing. Edit: There are still some issues with python 3.5, I will fix them . |
1 similar comment
Thanks @F-A and @MaxBenChrist ! |
Awesome! We are on the way to make tsfresh blazing fast ;) |
Nice refactoring! I'm glad my original investigations helped you :) |
Improve performance of the functions:
More specifically: apply the impute function directly on numpy array to improve computation time.
Now the impute function runs in 109ms (60 samples, 14256 features (or columns) ).
Note: I did not impoved the performance of impute_dataframe_range(...) since it would have been to much of a hassle to implement all the checks in that function, e.g. in case the min/max/median values of each columns are not present. In our case we call the get_range_values_per_column just before so these checks are not necessary.
So I just reimplemented the function impute_dataframe_rage directly in the impute function. This is less modular. Maybe you could pack this code in a new impute_dataframe_range function.
Solve the issue #123 .