ZeroDivisionError when trying to use RelevantFeatureAugmenter #524

nicholasg97 · 2019-03-31T09:13:27Z

I'm getting an error when I attempt to use the RelevantFeatureAugmenter both by itself and within a pipeline, both produce the same error.

I'm having a very similar issue as described here.

My original Dataframe "reframed" consists of an id, time columns as well as two additional columns, one with my time series lagged by 1 ( 'var1(t-1)' ) and the other with the target time series ( 'var1(t)' ). Thus my data format is flat. My target and feature columns are both numerical with negative, zero, and positive values.

My code is as follows,

reframed_features = reframed.drop(columns=['var1(t)'])
y = reframed['var1(t)']

augmenter = RelevantFeatureAugmenter(column_id='id', column_sort='time')

augmenter.set_timeseries_container(reframed_features)
X_train = pd.DataFrame(index=y.index)

augmenter.fit(X_train, y)

The same of the y Series is (25742,) and of X_train is (25742, 0) while reframed_features has a shape of (25742, 3).

The error I'm getting is

File "...\tsfresh\utilities\dataframe_functions.py", line 138 in _normalize_input_to_internal_representation

timeseries_container[column_sort] = np.repeat(sort, (len(timeseries_container) // len(sort))

ZeroDivisionError: integer division or modulo by zero

When I run the "pipeline_with_two_datasets.ipynb" everything works just fine.

If I add column_kind to the RelevantFeatureAugmenter I get a ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Could this be because I'm attempting to only extract values from one feature (although I just tested this and adding another feature doesn't seem to correct things)? Another area where I think things could be going wrong is that the dataframe I use to "set_timeseries_container" is the same length as the X_test I'm attempting to extract. If this is the cause is there no other way to extract the same features on a new data set, this would be a big issue for me as I'm going to be using new "online" data all of the time. Please forgive me if I'm making a silly error. Thanks.

Your operating system: Windows 10
The version of tsfresh that you are using: 0.11.2

The text was updated successfully, but these errors were encountered:

thbuerg · 2019-07-01T19:54:06Z

I experienced the same issue in an extract_features() call. Are there any updates on this?

nils-braun · 2019-11-17T16:26:24Z

Sorry for the long pause. I am not 100% sure I have already understood your problem. COuld you maybe post a small portion of your data or at least a df.describe so I see the format?
What do you mean by flat? Because of there is a division with 0 it would mean that you "sort" column has size 0...

Also, as things have changed, does it still occur with the newest version of tsfresh?

sokol11 · 2020-04-11T05:16:20Z

I am encountering this as well with tsfesh==0.15.1. In my case, extract_relevant_features call works fine but when I try to use the RelevantFeatureAugmenter the aforementioned error is thrown.

For now, I am looking for a way to avoid using the RelevantFeatureAugmenter class. Is there a way to store the relevant features I discover with extract_relevant_features on my training data set to later extract from the test set? Would I just create a dictionary of the d = {'feature_name': None} form and pass to the extract_features call, like so: extract_features(X_test, default_fc_parameters = d)?

Thanks!

nils-braun · 2020-04-11T07:46:40Z

Yes there is :-) Have a look into https://tsfresh.readthedocs.io/en/latest/text/feature_extraction_settings.html#a-handy-trick-do-i-really-have-to-create-the-dictionary-by-hand
It will explain you how to get the settings dictionary from the list of columns.

Back to the issue itself: Could someone share a minimal data example? Because I guess something is non-optimal with the data you feed in

sokol11 · 2020-04-11T11:10:18Z

@nils-braun Thanks for the fast reply. I am really at a loss of what could be wrong with my data, especially since the extract_relevant_features call works as expected. My df consists of about 300 timeseries with over a million observations in each. I guess I can try taking out columns one by one and see if any particular column gives the transformer trouble. Unfortunately, this is not a high priority for me at this time, so I may not get to it anytime soon.

Incidentally, was tsfresh largely built to extract features from financial data? I wonder what additional value do the tsfresh feature calculators bring over the traditional technical analysis indicators. Sorry, I know this is off-topic. But I just could not resist asking the question.

nils-braun · 2020-04-27T14:42:49Z

Could anyone of you (@sokol11 @thbuerg) share some example data so I can debug the issue?

Concerning your second question: I am not really an expert on financial time series. tsfresh was built for an Industry 4.0 application, but it is today also used for financial data (as far as I know).
tsfresh was preliminarily built for fast exploration and to be used in combination with ML (even though there are good other examples...). If you have good domain knowledge, it is probably easy to beat anything built on top of tsfresh - just because we can only do "the general stuff".
Having said that, if you know of any interesting financial features, we can also add them to tsfresh. They might be handy for other applications as well!

JacquesDonnelly · 2020-05-15T15:11:23Z

I have additionally ran into this issue with (on wsl)

python 3.8.2
pandas 1.0.3
tsfresh 0.16.0
numpy 1.18.4

Code:

import pandas as pd
import numpy as np
from tsfresh.transformers import FeatureAugmenter
from sklearn.pipeline import Pipeline
from tsfresh.utilities.dataframe_functions import roll_time_series

series_df = pd.DataFrame()
series_df["value"] = np.random.randint(0, 100, 10)
series_df["id"] = 1
series_df["time"] = np.arange(0, 10)
y = pd.Series(np.random.randint(0, 100, 10))
X = pd.DataFrame(index=y.index)

rolled_df = roll_time_series(
    series_df, column_id="id", column_sort="time", max_timeshift=4
)

pipeline = Pipeline(
    [
        (
            "augmenter",
            FeatureAugmenter(
                column_id="id",
                column_sort="time",
                timeseries_container=rolled_df,
                default_fc_parameters={"large_standard_deviation": [{"r": 0.1}]},
            ),
        ),
    ]
)

pipeline.fit_transform(X, y)

and Traceback:

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-13-e9dd7746b7ab> in <module>
     30 )
     31 
---> 32 pipeline.fit_transform(X, y)

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    389                 return Xt
    390             if hasattr(last_step, 'fit_transform'):
--> 391                 return last_step.fit_transform(Xt, y, **fit_params)
    392             else:
    393                 return last_step.fit(Xt, y, **fit_params).transform(Xt)

~/.local/lib/python3.8/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    572         else:
    573             # fit method of arity 2 (supervised transformation)
--> 574             return self.fit(X, y, **fit_params).transform(X)
    575 
    576 

~/.local/lib/python3.8/site-packages/tsfresh/transformers/feature_augmenter.py in transform(self, X)
    194         timeseries_container_X = restrict_input_to_index(self.timeseries_container, self.column_id, X.index)
    195 
--> 196         extracted_features = extract_features(timeseries_container_X,
    197                                               default_fc_parameters=self.default_fc_parameters,
    198                                               kind_to_fc_parameters=self.kind_to_fc_parameters,

~/.local/lib/python3.8/site-packages/tsfresh/feature_extraction/extraction.py in extract_features(timeseries_container, default_fc_parameters, kind_to_fc_parameters, column_id, column_sort, column_kind, column_value, chunksize, n_jobs, show_warnings, disable_progressbar, impute_function, profile, profiling_filename, profiling_sorting, distributor)
    129     # See the function normalize_input_to_internal_representation for more information.
    130     df_melt, column_id, column_kind, column_value = \
--> 131         dataframe_functions._normalize_input_to_internal_representation(
    132             timeseries_container=timeseries_container,
    133             column_id=column_id, column_kind=column_kind,

~/.local/lib/python3.8/site-packages/tsfresh/utilities/dataframe_functions.py in _normalize_input_to_internal_representation(timeseries_container, column_id, column_sort, column_kind, column_value)
    340                                        value_name=column_value, var_name=column_kind)
    341         timeseries_container = timeseries_container.set_index(index_name)
--> 342         timeseries_container[column_sort] = np.tile(sort, (len(timeseries_container) // len(sort)))
    343 
    344     # Check kind column

ZeroDivisionError: integer division or modulo by zero

JacquesDonnelly · 2020-05-15T15:49:43Z

I believe the issue is coming from line 194 of transformers.feature_augmenter. The time series container passed to extract features is given by

timeseries_container_X = restrict_input_to_index(self.timeseries_container, self.column_id, X.index)

So in my case,
self.timeseries_container is rolled_df
self.column_id is "id"
X.index is RangeIndex(start=0, stop=10, step=1)
and the resulting timeseries_container_X is an empty dataframe with index [].

Then it seems the offending line in utilities.dataframe_functions is line 319 with

sort = range(len(timeseries_container))

JacquesDonnelly · 2020-05-15T22:18:43Z

To fix my issue, I just had to make sure the index of X matches the index of rolled_df

This was achieved with

X = pd.DataFrame(index=rolled_df["id"].unique())

rather than

X = pd.DataFrame(index=y.index)

nils-braun · 2020-05-16T08:42:57Z

Thanks @JacquesDonnelly ! Now it starts to make sense to me :-)
That is actually a good point: we assume that the X dataframe you pass into the FeatureAugmentor has an index, which only contains entries which are present in the id column of the time series container. Honestly, we did not stress this enough in the documentation and we should include an assert for that.

Does this maybe also solve the issue of the others?

nils-braun · 2020-06-11T10:40:19Z

I assume that the assertion introduced in #690 helps also solving the other problems from this thread? If not, please re-open this issue! :-)

MaxBenChrist added the bug label Mar 31, 2019

nils-braun mentioned this issue May 16, 2020

Prevent wrong indices in the passed time series container #690

Merged

nils-braun closed this as completed Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroDivisionError when trying to use RelevantFeatureAugmenter #524

ZeroDivisionError when trying to use RelevantFeatureAugmenter #524

nicholasg97 commented Mar 31, 2019 •

edited by MaxBenChrist

Loading

thbuerg commented Jul 1, 2019 •

edited

Loading

nils-braun commented Nov 17, 2019

sokol11 commented Apr 11, 2020 •

edited

Loading

nils-braun commented Apr 11, 2020

sokol11 commented Apr 11, 2020

nils-braun commented Apr 27, 2020

JacquesDonnelly commented May 15, 2020

JacquesDonnelly commented May 15, 2020 •

edited

Loading

JacquesDonnelly commented May 15, 2020

nils-braun commented May 16, 2020

nils-braun commented Jun 11, 2020

ZeroDivisionError when trying to use RelevantFeatureAugmenter #524

ZeroDivisionError when trying to use RelevantFeatureAugmenter #524

Comments

nicholasg97 commented Mar 31, 2019 • edited by MaxBenChrist Loading

thbuerg commented Jul 1, 2019 • edited Loading

nils-braun commented Nov 17, 2019

sokol11 commented Apr 11, 2020 • edited Loading

nils-braun commented Apr 11, 2020

sokol11 commented Apr 11, 2020

nils-braun commented Apr 27, 2020

JacquesDonnelly commented May 15, 2020

JacquesDonnelly commented May 15, 2020 • edited Loading

JacquesDonnelly commented May 15, 2020

nils-braun commented May 16, 2020

nils-braun commented Jun 11, 2020

nicholasg97 commented Mar 31, 2019 •

edited by MaxBenChrist

Loading

thbuerg commented Jul 1, 2019 •

edited

Loading

sokol11 commented Apr 11, 2020 •

edited

Loading

JacquesDonnelly commented May 15, 2020 •

edited

Loading