Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve index when column_value and column_kind not provided #576

Merged
merged 2 commits into from
Nov 17, 2019

Conversation

jeffzi
Copy link
Contributor

@jeffzi jeffzi commented Oct 16, 2019

Hello,

This PR solves #557. The support for index based features introduced in tsfresh 0.12.0 only works when column_kind and column_value are passed to extract_features().

Here is an example of the issue:

import pandas as pd
from tsfresh.feature_extraction import extract_features
from tsfresh.feature_extraction.settings import TimeBasedFCParameters  

df = pd.DataFrame({"id": ["a", "a", "a", "a", "b", "b", "b", "b"], 
                   "value": [1, 2, 3, 1, 3, 1, 0, 8],
                   "kind": ["temperature", "temperature", "pressure", "pressure",
                            "temperature", "temperature", "pressure", "pressure"]})

# Without pd.DatetimeIndex -> empty (as expected)
extracted = extract_features(df, column_id="id", column_value='value', column_kind='kind',
                             default_fc_parameters=TimeBasedFCParameters())
extracted.info()
#> <class 'pandas.core.frame.DataFrame'>
#> Index: 0 entries
#> Empty DataFrame
# With pd.DatetimeIndex -> works
time_index = pd.DatetimeIndex(
    ['2019-03-01 10:04:00', '2019-03-01 10:50:00', '2019-03-02 00:00:00', '2019-03-02 09:04:59',
     '2019-03-02 23:54:12', '2019-03-03 08:13:04', '2019-03-04 08:00:00', '2019-03-04 08:01:00']
)
df = df.set_index(time_index).sort_index()
extracted = extract_features(df, column_id="id", column_value='value', column_kind='kind',
                             default_fc_parameters=TimeBasedFCParameters())
extracted.info()
#> <class 'pandas.core.frame.DataFrame'>
#> Index: 2 entries, a to b
#> Data columns (total 10 columns):
#> pressure__linear_trend_timewise__attr_"intercept"       2 non-null float64
#> pressure__linear_trend_timewise__attr_"pvalue"          2 non-null float64
#> pressure__linear_trend_timewise__attr_"rvalue"          2 non-null float64
#> pressure__linear_trend_timewise__attr_"slope"           2 non-null float64
#> pressure__linear_trend_timewise__attr_"stderr"          2 non-null float64
#> temperature__linear_trend_timewise__attr_"intercept"    2 non-null float64
#> temperature__linear_trend_timewise__attr_"pvalue"       2 non-null float64
#> temperature__linear_trend_timewise__attr_"rvalue"       2 non-null float64
#> temperature__linear_trend_timewise__attr_"slope"        2 non-null float64
#> temperature__linear_trend_timewise__attr_"stderr"       2 non-null float64
#> dtypes: float64(10)
#> memory usage: 176.0+ bytes
# With pd.DatetimeIndex but without specifying column_value and column_kind 
# -> empty without this PR
extracted = (df
             .groupby('kind')
             .apply(lambda g : extract_features(g, column_id="id",
                          default_fc_parameters=TimeBasedFCParameters()))
            )
extracted.info()
#> <class 'pandas.core.frame.DataFrame'>
#> MultiIndex: 0 entries
#> Empty DataFrame

Created on 2019-10-16 by the reprexpy package

The bug comes from the index not being preserved when melting the dataframe in _normalize_input_to_internal_representation, specifically when column_kind and column_value are None.

@coveralls
Copy link

coveralls commented Oct 16, 2019

Coverage Status

Coverage remained the same at ?% when pulling 23b98e0 on jeffzi:fix_index_based_features into c3d35c4 on blue-yonder:master.

@jeffzi jeffzi changed the title Preserve index when column_value and _column_kind not provided Preserve index when column_value and column_kind not provided Oct 17, 2019
@nils-braun
Copy link
Collaborator

Thanks for the fix.
Without further checking (sorry, not much time now, will do tomorrow): don't you end up with the additional index column?

@jeffzi
Copy link
Contributor Author

jeffzi commented Nov 16, 2019

The index doesn't appear in the results of extract_features.

The line below sets the index after the melt, therefore the index is not in the column list.

timeseries_container = timeseries_container.set_index('index')

@nils-braun
Copy link
Collaborator

Ah sorry sorry. I forgot that drop=True is the standard in set_index (thought it would be drop=False).
Then this is fine!

@nils-braun
Copy link
Collaborator

Do we also want that for the other meld in the lines above? (if you want you can also pull the two common lines with the meld an the tile out of the if branches to reduce code duplication)

@jeffzi
Copy link
Contributor Author

jeffzi commented Nov 16, 2019

Yes I think the other melt should be modified as well. I didn't test the column_sort argument before, my bad !

I've deduplicated the code as you suggested. Let me know if that's ok :)

@nils-braun
Copy link
Collaborator

Looks good as far as I can tell!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants