Improve performance of the impute function. #135

F-A · 2017-01-06T13:37:46Z

Improve performance of the functions:

utilities.dataframe_function.get_range_values_per_column(df)
utilities.dataframe_function.impute(df_impute)
More specifically: apply the impute function directly on numpy array to improve computation time.

Now the impute function runs in 109ms (60 samples, 14256 features (or columns) ).

Note: I did not impoved the performance of impute_dataframe_range(...) since it would have been to much of a hassle to implement all the checks in that function, e.g. in case the min/max/median values of each columns are not present. In our case we call the get_range_values_per_column just before so these checks are not necessary.
So I just reimplemented the function impute_dataframe_rage directly in the impute function. This is less modular. Maybe you could pack this code in a new impute_dataframe_range function.

Solve the issue #123 .

…ion time.

coveralls · 2017-01-06T13:40:43Z

Coverage increased (+0.04%) to 96.727% when pulling c3205e9 on F-A:master into 32a53ca on blue-yonder:master.

MaxBenChrist

Great @F-A , thank you for helping to make tsfresh faster.

MaxBenChrist · 2017-01-06T16:46:53Z

tsfresh/utilities/dataframe_functions.py

+        df_impute.values[ indices ] = newValues
+
+    # Ensure a type of "np.float64"
+    df_impute.astype(np.float64)


you have to set copy=False here

otherwise it will just return a copy

MaxBenChrist · 2017-01-06T17:01:33Z

Note: I did not impoved the performance of impute_dataframe_range(...) since it would have been to much of a hassle to implement all the checks in that function, e.g. in case the min/max/median values of each columns are not present.

I think the only check that is missing in your version is if there are no finite values at all

            if len(finite_values_in_column) == 0:
                _logger.warning(
                    "The replacement column {} did not have any finite values. Filling with zeros.".format(column_name))
                df_impute[column_name] = [0] * len(column)
                continue

F-A · 2017-01-06T17:08:32Z

The check in case there no finite values in done in the "get_range_values_per_column" function. Line 173-175. But it does not log anything.

MaxBenChrist · 2017-01-06T17:14:12Z

Ah yes you are right, couldn't you do something like

if sum(masked.mask.sum(axis=0) == masked.data.shape[0]) > 0
    #print logging message

just before line 173?

coveralls · 2017-01-06T17:16:56Z

Coverage increased (+0.04%) to 96.727% when pulling 84cfa62 on F-A:master into 32a53ca on blue-yonder:master.

coveralls · 2017-01-06T17:16:56Z

Coverage increased (+0.04%) to 96.727% when pulling 84cfa62 on F-A:master into 32a53ca on blue-yonder:master.

F-A · 2017-01-06T17:25:46Z

The log message was previously done in the impute function directly. In this function I do not create the masked array.
I could add the warning in the "get_range_values_per_column" where I have the masked array but this does make less sense...

jneuff · 2017-01-06T17:57:03Z

@F-A Nice work!

The impute function is provided as a default for users who don't know which impute function to choose (we currently have two: impute_dataframe_range and impute_zero). From the semantics you're proposal is an improvement of the impute_dataframe_range function. So that's the function it should replace and impute should remain just a wrapper.

jneuff · 2017-01-06T17:59:20Z

Ah, and while we're at it: I'd prefer the impute function to return the altered dataframe. Currently we're inconsistent about this.

MaxBenChrist · 2017-01-06T18:10:00Z

Ah, and while we're at it: I'd prefer the impute function to return the altered dataframe. Currently we're inconsistent about this.

The problem is that you create (temporary) copys of the dataframe if you impute that way. For big dataframes this could result in out of memory errors. I think we should do all imputing inplace

F-A · 2017-01-06T18:21:21Z

Rewriting the impute_dataframe_range with its current behavior would take me too much time: it assumes the max/min and median dict given in parameters can be incomplete, which results in a lot of column-wise checks. The impute function directly compute these dictionaries, so we know they are complete, which make the implementation easier.

As I mentioned earlier, I do not have much time to accord to tsfresh. I just wanted to test it on my data and I fixed this issue to do so. This PR correspond to my fix.
You can accept it if you want and improve the functions "impute_dataframe_range" and "impute_zero" in the same spirit later on. But this will be too much for me right now, sorry.

MaxBenChrist · 2017-01-06T22:46:24Z

Thank you. no worries, we will take over from here.

F-A · 2017-01-07T08:48:48Z

Yes of course. I'll grant you access to it when I am in front of a computer. On 6 Jan 2017 23:46, "Maximilian Christ" <notifications@github.com> wrote: Thank you. no worries, we will take over from here. May I push some changes to your master branch to update this pr? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#135 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFDP3QMVHqgJkvBFZ9q6n7dM6HOmyaV0ks5rPsRBgaJpZM4LcuDU> .

MaxBenChrist · 2017-01-07T10:29:39Z

Yes of course. I'll grant you access to it when I am in front of a computer.

Perfect, let me know when you arranged it.

jneuff · 2017-01-08T10:55:26Z

The problem is that you create (temporary) copys of the dataframe if you impute that way. For big dataframes this could result in out of memory errors. I think we should do all imputing inplace

We could do it in place and return the dataframe. That way, we won't have to change the function if we later decide to make our data flow more explicit as some kind of pipeline.

@F-A No, worries. Thanks for contributing!

I just noticed, that we currently have quite some code duplication between impute_dataframe_range and get_range_values_per_column. We should simplify that anyway. This PR seems like a good opportunity.

F-A · 2017-01-08T12:17:06Z

Ok I added both of you as collabortors for my fork. You should be able to push then.

MaxBenChrist · 2017-01-08T12:45:58Z

Thanks, let's use this opportunity to refactor the module. I will have a look at it during the next days.

nils-braun · 2017-01-23T20:01:21Z

@MaxBenChrist Is there news on this branch and PR? Or do you want to tell us your plans for refactoring - so we can help out? ;-)

MaxBenChrist · 2017-01-23T20:16:45Z

I do not have a fixed plan yet but some aspects I want to tackle

remove code duplication between impute_dataframe_range and get_range_values_per_column
maybe remove impute_dataframe_zero
check if we still need get_range_values_per_column if we use the masks
...

Maybe I will find some time during the next days to do this.

if dict does not contain value for column return value error

MaxBenChrist · 2017-01-31T14:27:52Z

I finally found some hours to spend on this. I finished refactoring the dataframe_functions module.

However, I found that the usage of the impute functions in the transformers is not right. I will try to fix that as well

MaxBenChrist · 2017-01-31T15:34:37Z

Well, I finished everything. If either @nils-braun or @jneuff gives his go, we can merge this thing.

Edit: There are still some issues with python 3.5, I will fix them .

coveralls · 2017-01-31T15:37:50Z

Coverage increased (+0.09%) to 96.815% when pulling b2ceb90 on F-A:master into 31ee428 on blue-yonder:master.

coveralls · 2017-01-31T15:53:12Z

Coverage increased (+0.09%) to 96.815% when pulling 2601cb8 on F-A:master into 31ee428 on blue-yonder:master.

coveralls · 2017-02-01T09:55:02Z

Coverage increased (+0.09%) to 96.815% when pulling 9cfa13f on F-A:master into 31ee428 on blue-yonder:master.

coveralls · 2017-02-01T10:12:07Z

Coverage increased (+0.09%) to 96.815% when pulling 8f65f5b on F-A:master into 31ee428 on blue-yonder:master.

coveralls · 2017-02-01T10:14:10Z

Coverage increased (+0.09%) to 96.815% when pulling 8f65f5b on F-A:master into 31ee428 on blue-yonder:master.

coveralls · 2017-02-01T10:39:11Z

Coverage increased (+0.09%) to 96.815% when pulling 49d5121 on F-A:master into 31ee428 on blue-yonder:master.

coveralls · 2017-02-01T10:39:11Z

Coverage increased (+0.09%) to 96.815% when pulling 49d5121 on F-A:master into 31ee428 on blue-yonder:master.

jneuff · 2017-02-01T10:44:53Z

Thanks @F-A and @MaxBenChrist !

MaxBenChrist · 2017-02-01T11:12:57Z

Awesome! We are on the way to make tsfresh blazing fast ;)

F-A · 2017-02-02T09:56:08Z

Nice refactoring! I'm glad my original investigations helped you :)

Apply the impute function directly on numpy array to improve computat…

c3205e9

…ion time.

MaxBenChrist approved these changes Jan 6, 2017

View reviewed changes

Fix dataframe element not being converted to np.float64 in impute.

84cfa62

MaxBenChrist added 8 commits January 31, 2017 12:07

speed up check_for_nans_in_columns

13bff1c

extended check_for_nans_in_columns unit tests

10a14cc

impute should also return the imputed dataframe

e6640d0

impute_dataframe_zero should return imputed dataframe

1cc0588

fixed typo in ValueError Message text

25acc63

clarified docstring of impute_dataframe_zero

fb123a5

fixed rtype and return parts for docstrings

30d84b3

polished docstring of get_range_values_per_column

0793cae

MaxBenChrist added 6 commits January 31, 2017 12:54

refactored get_range_values_per_columns, added logger

327dd4e

impute_dataframe_range should not accept partial dicts

1fbde1b

if dict does not contain value for column return value error

refactoring of data_frame_functions

11001f2

added check to impute_data_frame_range for non finite replacement values

3ca2fce

polished unit tests for impute_data_frame_range

416582b

forgot to raise the value errors in impute_dataframe_range

5b1c555

MaxBenChrist added 4 commits January 31, 2017 16:31

fixed imputing strategy for relevant_feature_augmenter

f077cf0

refactored unit tests for impute range

7540b57

accept replace dics that contain more than just columns

2933225

Merge branch 'master' into master

b2ceb90

fixed dict error for python3

2601cb8

Replace [] with list

9cfa13f

Julius Neuffer added 2 commits February 1, 2017 11:09

Remove superfluous 'to'

e79259d

Improve docstring

8f65f5b

Julius Neuffer added 2 commits February 1, 2017 11:35

Extract and improve tests for get_range_values_per_column

e98ae51

Remove superfluous checks

49d5121

jneuff merged commit e3dea11 into blue-yonder:master Feb 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of the impute function. #135

Improve performance of the impute function. #135

F-A commented Jan 6, 2017

coveralls commented Jan 6, 2017 •

edited

MaxBenChrist left a comment

MaxBenChrist Jan 6, 2017

MaxBenChrist Jan 6, 2017

MaxBenChrist commented Jan 6, 2017

F-A commented Jan 6, 2017

MaxBenChrist commented Jan 6, 2017 •

edited

coveralls commented Jan 6, 2017

coveralls commented Jan 6, 2017

F-A commented Jan 6, 2017

jneuff commented Jan 6, 2017

jneuff commented Jan 6, 2017

MaxBenChrist commented Jan 6, 2017 •

edited

F-A commented Jan 6, 2017

MaxBenChrist commented Jan 6, 2017 •

edited

F-A commented Jan 7, 2017 via email

MaxBenChrist commented Jan 7, 2017

jneuff commented Jan 8, 2017

F-A commented Jan 8, 2017

MaxBenChrist commented Jan 8, 2017

nils-braun commented Jan 23, 2017

MaxBenChrist commented Jan 23, 2017 •

edited

MaxBenChrist commented Jan 31, 2017

MaxBenChrist commented Jan 31, 2017 •

edited

coveralls commented Jan 31, 2017

coveralls commented Jan 31, 2017

coveralls commented Feb 1, 2017 •

edited

coveralls commented Feb 1, 2017 •

edited

coveralls commented Feb 1, 2017 •

edited

coveralls commented Feb 1, 2017

coveralls commented Feb 1, 2017

jneuff commented Feb 1, 2017

MaxBenChrist commented Feb 1, 2017

F-A commented Feb 2, 2017

Improve performance of the impute function. #135

Improve performance of the impute function. #135

Conversation

F-A commented Jan 6, 2017

coveralls commented Jan 6, 2017 • edited

MaxBenChrist left a comment

Choose a reason for hiding this comment

MaxBenChrist Jan 6, 2017

Choose a reason for hiding this comment

MaxBenChrist Jan 6, 2017

Choose a reason for hiding this comment

MaxBenChrist commented Jan 6, 2017

F-A commented Jan 6, 2017

MaxBenChrist commented Jan 6, 2017 • edited

coveralls commented Jan 6, 2017

coveralls commented Jan 6, 2017

F-A commented Jan 6, 2017

jneuff commented Jan 6, 2017

jneuff commented Jan 6, 2017

MaxBenChrist commented Jan 6, 2017 • edited

F-A commented Jan 6, 2017

MaxBenChrist commented Jan 6, 2017 • edited

F-A commented Jan 7, 2017 via email

MaxBenChrist commented Jan 7, 2017

jneuff commented Jan 8, 2017

F-A commented Jan 8, 2017

MaxBenChrist commented Jan 8, 2017

nils-braun commented Jan 23, 2017

MaxBenChrist commented Jan 23, 2017 • edited

MaxBenChrist commented Jan 31, 2017

MaxBenChrist commented Jan 31, 2017 • edited

coveralls commented Jan 31, 2017

coveralls commented Jan 31, 2017

coveralls commented Feb 1, 2017 • edited

coveralls commented Feb 1, 2017 • edited

coveralls commented Feb 1, 2017 • edited

coveralls commented Feb 1, 2017

coveralls commented Feb 1, 2017

jneuff commented Feb 1, 2017

MaxBenChrist commented Feb 1, 2017

F-A commented Feb 2, 2017

coveralls commented Jan 6, 2017 •

edited

MaxBenChrist commented Jan 6, 2017 •

edited

MaxBenChrist commented Jan 6, 2017 •

edited

MaxBenChrist commented Jan 6, 2017 •

edited

MaxBenChrist commented Jan 23, 2017 •

edited

MaxBenChrist commented Jan 31, 2017 •

edited

coveralls commented Feb 1, 2017 •

edited

coveralls commented Feb 1, 2017 •

edited

coveralls commented Feb 1, 2017 •

edited