Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroDivisionError when trying to use RelevantFeatureAugmenter #524

Closed
nicholasg97 opened this issue Mar 31, 2019 · 11 comments
Closed

ZeroDivisionError when trying to use RelevantFeatureAugmenter #524

nicholasg97 opened this issue Mar 31, 2019 · 11 comments
Labels

Comments

@nicholasg97
Copy link

nicholasg97 commented Mar 31, 2019

I'm getting an error when I attempt to use the RelevantFeatureAugmenter both by itself and within a pipeline, both produce the same error.

I'm having a very similar issue as described here.

My original Dataframe "reframed" consists of an id, time columns as well as two additional columns, one with my time series lagged by 1 ( 'var1(t-1)' ) and the other with the target time series ( 'var1(t)' ). Thus my data format is flat. My target and feature columns are both numerical with negative, zero, and positive values.

My code is as follows,

reframed_features = reframed.drop(columns=['var1(t)'])
y = reframed['var1(t)']

augmenter = RelevantFeatureAugmenter(column_id='id', column_sort='time')

augmenter.set_timeseries_container(reframed_features)
X_train = pd.DataFrame(index=y.index)

augmenter.fit(X_train, y)

The same of the y Series is (25742,) and of X_train is (25742, 0) while reframed_features has a shape of (25742, 3).

The error I'm getting is

File "...\tsfresh\utilities\dataframe_functions.py", line 138 in _normalize_input_to_internal_representation

timeseries_container[column_sort] = np.repeat(sort, (len(timeseries_container) // len(sort))

ZeroDivisionError: integer division or modulo by zero

When I run the "pipeline_with_two_datasets.ipynb" everything works just fine.

If I add column_kind to the RelevantFeatureAugmenter I get a ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Could this be because I'm attempting to only extract values from one feature (although I just tested this and adding another feature doesn't seem to correct things)? Another area where I think things could be going wrong is that the dataframe I use to "set_timeseries_container" is the same length as the X_test I'm attempting to extract. If this is the cause is there no other way to extract the same features on a new data set, this would be a big issue for me as I'm going to be using new "online" data all of the time. Please forgive me if I'm making a silly error. Thanks.

  1. Your operating system: Windows 10
  2. The version of tsfresh that you are using: 0.11.2
@thbuerg
Copy link

thbuerg commented Jul 1, 2019

I experienced the same issue in an extract_features() call. Are there any updates on this?

@nils-braun
Copy link
Collaborator

Sorry for the long pause. I am not 100% sure I have already understood your problem. COuld you maybe post a small portion of your data or at least a df.describe so I see the format?
What do you mean by flat? Because of there is a division with 0 it would mean that you "sort" column has size 0...

Also, as things have changed, does it still occur with the newest version of tsfresh?

@sokol11
Copy link

sokol11 commented Apr 11, 2020

I am encountering this as well with tsfesh==0.15.1. In my case, extract_relevant_features call works fine but when I try to use the RelevantFeatureAugmenter the aforementioned error is thrown.

For now, I am looking for a way to avoid using the RelevantFeatureAugmenter class. Is there a way to store the relevant features I discover with extract_relevant_features on my training data set to later extract from the test set? Would I just create a dictionary of the d = {'feature_name': None} form and pass to the extract_features call, like so: extract_features(X_test, default_fc_parameters = d)?

Thanks!

@nils-braun
Copy link
Collaborator

Yes there is :-) Have a look into https://tsfresh.readthedocs.io/en/latest/text/feature_extraction_settings.html#a-handy-trick-do-i-really-have-to-create-the-dictionary-by-hand
It will explain you how to get the settings dictionary from the list of columns.

Back to the issue itself: Could someone share a minimal data example? Because I guess something is non-optimal with the data you feed in

@sokol11
Copy link

sokol11 commented Apr 11, 2020

@nils-braun Thanks for the fast reply. I am really at a loss of what could be wrong with my data, especially since the extract_relevant_features call works as expected. My df consists of about 300 timeseries with over a million observations in each. I guess I can try taking out columns one by one and see if any particular column gives the transformer trouble. Unfortunately, this is not a high priority for me at this time, so I may not get to it anytime soon.

Incidentally, was tsfresh largely built to extract features from financial data? I wonder what additional value do the tsfresh feature calculators bring over the traditional technical analysis indicators. Sorry, I know this is off-topic. But I just could not resist asking the question.

@nils-braun
Copy link
Collaborator

Could anyone of you (@sokol11 @thbuerg) share some example data so I can debug the issue?

Concerning your second question: I am not really an expert on financial time series. tsfresh was built for an Industry 4.0 application, but it is today also used for financial data (as far as I know).
tsfresh was preliminarily built for fast exploration and to be used in combination with ML (even though there are good other examples...). If you have good domain knowledge, it is probably easy to beat anything built on top of tsfresh - just because we can only do "the general stuff".
Having said that, if you know of any interesting financial features, we can also add them to tsfresh. They might be handy for other applications as well!

@JacquesDonnelly
Copy link

I have additionally ran into this issue with (on wsl)

  • python 3.8.2
  • pandas 1.0.3
  • tsfresh 0.16.0
  • numpy 1.18.4

Code:

import pandas as pd
import numpy as np
from tsfresh.transformers import FeatureAugmenter
from sklearn.pipeline import Pipeline
from tsfresh.utilities.dataframe_functions import roll_time_series

series_df = pd.DataFrame()
series_df["value"] = np.random.randint(0, 100, 10)
series_df["id"] = 1
series_df["time"] = np.arange(0, 10)
y = pd.Series(np.random.randint(0, 100, 10))
X = pd.DataFrame(index=y.index)

rolled_df = roll_time_series(
    series_df, column_id="id", column_sort="time", max_timeshift=4
)

pipeline = Pipeline(
    [
        (
            "augmenter",
            FeatureAugmenter(
                column_id="id",
                column_sort="time",
                timeseries_container=rolled_df,
                default_fc_parameters={"large_standard_deviation": [{"r": 0.1}]},
            ),
        ),
    ]
)

pipeline.fit_transform(X, y)

and Traceback:

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-13-e9dd7746b7ab> in <module>
     30 )
     31 
---> 32 pipeline.fit_transform(X, y)

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    389                 return Xt
    390             if hasattr(last_step, 'fit_transform'):
--> 391                 return last_step.fit_transform(Xt, y, **fit_params)
    392             else:
    393                 return last_step.fit(Xt, y, **fit_params).transform(Xt)

~/.local/lib/python3.8/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    572         else:
    573             # fit method of arity 2 (supervised transformation)
--> 574             return self.fit(X, y, **fit_params).transform(X)
    575 
    576 

~/.local/lib/python3.8/site-packages/tsfresh/transformers/feature_augmenter.py in transform(self, X)
    194         timeseries_container_X = restrict_input_to_index(self.timeseries_container, self.column_id, X.index)
    195 
--> 196         extracted_features = extract_features(timeseries_container_X,
    197                                               default_fc_parameters=self.default_fc_parameters,
    198                                               kind_to_fc_parameters=self.kind_to_fc_parameters,

~/.local/lib/python3.8/site-packages/tsfresh/feature_extraction/extraction.py in extract_features(timeseries_container, default_fc_parameters, kind_to_fc_parameters, column_id, column_sort, column_kind, column_value, chunksize, n_jobs, show_warnings, disable_progressbar, impute_function, profile, profiling_filename, profiling_sorting, distributor)
    129     # See the function normalize_input_to_internal_representation for more information.
    130     df_melt, column_id, column_kind, column_value = \
--> 131         dataframe_functions._normalize_input_to_internal_representation(
    132             timeseries_container=timeseries_container,
    133             column_id=column_id, column_kind=column_kind,

~/.local/lib/python3.8/site-packages/tsfresh/utilities/dataframe_functions.py in _normalize_input_to_internal_representation(timeseries_container, column_id, column_sort, column_kind, column_value)
    340                                        value_name=column_value, var_name=column_kind)
    341         timeseries_container = timeseries_container.set_index(index_name)
--> 342         timeseries_container[column_sort] = np.tile(sort, (len(timeseries_container) // len(sort)))
    343 
    344     # Check kind column

ZeroDivisionError: integer division or modulo by zero

@JacquesDonnelly
Copy link

JacquesDonnelly commented May 15, 2020

I believe the issue is coming from line 194 of transformers.feature_augmenter. The time series container passed to extract features is given by

timeseries_container_X = restrict_input_to_index(self.timeseries_container, self.column_id, X.index)

So in my case,
self.timeseries_container is rolled_df
self.column_id is "id"
X.index is RangeIndex(start=0, stop=10, step=1)
and the resulting timeseries_container_X is an empty dataframe with index [].

Then it seems the offending line in utilities.dataframe_functions is line 319 with

sort = range(len(timeseries_container))

@JacquesDonnelly
Copy link

To fix my issue, I just had to make sure the index of X matches the index of rolled_df

This was achieved with

X = pd.DataFrame(index=rolled_df["id"].unique())

rather than

X = pd.DataFrame(index=y.index)

@nils-braun
Copy link
Collaborator

Thanks @JacquesDonnelly ! Now it starts to make sense to me :-)
That is actually a good point: we assume that the X dataframe you pass into the FeatureAugmentor has an index, which only contains entries which are present in the id column of the time series container. Honestly, we did not stress this enough in the documentation and we should include an assert for that.

Does this maybe also solve the issue of the others?

@nils-braun
Copy link
Collaborator

I assume that the assertion introduced in #690 helps also solving the other problems from this thread? If not, please re-open this issue! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants