# Example of extracting features from dataframes with Datetime indices

Assuming that time-varying measurements are taken at regular intervals can be sufficient for many situations. However, for a large number of tasks it is important to take into account **when** a measurement is made. An example can be healthcare, where the interval between measurements of vital signs contains crucial information. 

Tsfresh now supports calculator functions that use the index of the timeseries container in order to calculate the features. The only requirements for these function is that the index of the input dataframe is of type `pd.DatetimeIndex`. These functions are contained in the new class TimeBasedFCParameters.

Note that the behaviour of all other functions is unaffected. The settings parameter of `extract_features()` can contain both index-dependent functions and 'regular' functions.

In [44]:
import pandas as pd
from tsfresh.feature_extraction import extract_features
# TimeBasedFCParameters contains all functions that use the Datetime index of the timeseries container
from tsfresh.feature_extraction.settings import TimeBasedFCParameters  

# Build a time series container with Datetime indices

Let's build a dataframe with a datetime index. The format must be with a `value` and a `kind` column, since each measurement has its own timestamp - i.e. measurements are not assumed to be simultaneous.

In [38]:
df = pd.DataFrame({"id": ["a", "a", "a", "a", "b", "b", "b", "b"], 
                   "value": [1, 2, 3, 1, 3, 1, 0, 8],
                   "kind": ["temperature", "temperature", "pressure", "pressure",
                            "temperature", "temperature", "pressure", "pressure"]},
                   index=pd.DatetimeIndex(
                       ['2019-03-01 10:04:00', '2019-03-01 10:50:00', '2019-03-02 00:00:00', '2019-03-02 09:04:59',
                        '2019-03-02 23:54:12', '2019-03-03 08:13:04', '2019-03-04 08:00:00', '2019-03-04 08:01:00']
                   ))
df = df.sort_index()
df

Unnamed: 0,id,value,kind
2019-03-01 10:04:00,a,1,temperature
2019-03-01 10:50:00,a,2,temperature
2019-03-02 00:00:00,a,3,pressure
2019-03-02 09:04:59,a,1,pressure
2019-03-02 23:54:12,b,3,temperature
2019-03-03 08:13:04,b,1,temperature
2019-03-04 08:00:00,b,0,pressure
2019-03-04 08:01:00,b,8,pressure


Right now `TimeBasedFCParameters` only contains `linear_trend_timewise`, which performs a calculation of a linear trend, but using the time difference in hours between measurements in order to perform the linear regression. As always, you can add your own functions in `tsfresh/feature_extraction/feature_calculators.py`.

In [39]:
settings_time = TimeBasedFCParameters()
settings_time

{'linear_trend_timewise': [{'attr': 'pvalue'},
  {'attr': 'rvalue'},
  {'attr': 'intercept'},
  {'attr': 'slope'},
  {'attr': 'stderr'}]}

We extract the features as usual, specifying the column value, kind, and id.

In [41]:
X_tsfresh = extract_features(df, column_id="id", column_value='value', column_kind='kind',
                             default_fc_parameters=settings_time)
X_tsfresh.head()

Feature Extraction: 100%|██████████| 4/4 [00:00<00:00, 591.10it/s]


variable,"pressure__linear_trend_timewise__attr_""intercept""","pressure__linear_trend_timewise__attr_""pvalue""","pressure__linear_trend_timewise__attr_""rvalue""","pressure__linear_trend_timewise__attr_""slope""","pressure__linear_trend_timewise__attr_""stderr""","temperature__linear_trend_timewise__attr_""intercept""","temperature__linear_trend_timewise__attr_""pvalue""","temperature__linear_trend_timewise__attr_""rvalue""","temperature__linear_trend_timewise__attr_""slope""","temperature__linear_trend_timewise__attr_""stderr"""
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,3.0,0.0,-1.0,-0.22019,0.0,1.0,0.0,1.0,1.304348,0.0
b,0.0,0.0,1.0,480.0,0.0,3.0,0.0,-1.0,-0.240545,0.0


The output looks exactly, like usual. If we compare it with the 'regular' `linear_trend` feature calculator, we can see that the intercept, p and R values are the same, as we'd expect – only the slope is now different.

In [42]:
settings_regular = {'linear_trend': [
  {'attr': 'pvalue'},
  {'attr': 'rvalue'},
  {'attr': 'intercept'},
  {'attr': 'slope'},
  {'attr': 'stderr'}
]}

In [43]:
X_tsfresh = extract_features(df, column_id="id", column_value='value', column_kind='kind',
                             default_fc_parameters=settings_regular)
X_tsfresh.head()

Feature Extraction: 100%|██████████| 4/4 [00:00<00:00, 2517.59it/s]


variable,"pressure__linear_trend__attr_""intercept""","pressure__linear_trend__attr_""pvalue""","pressure__linear_trend__attr_""rvalue""","pressure__linear_trend__attr_""slope""","pressure__linear_trend__attr_""stderr""","temperature__linear_trend__attr_""intercept""","temperature__linear_trend__attr_""pvalue""","temperature__linear_trend__attr_""rvalue""","temperature__linear_trend__attr_""slope""","temperature__linear_trend__attr_""stderr"""
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,3.0,0.0,-1.0,-2.0,0.0,1.0,0.0,1.0,1.0,0.0
b,0.0,0.0,1.0,8.0,0.0,3.0,0.0,-1.0,-2.0,0.0


# Writing your own time-based feature calculators

Writing your own time-based feature calculators is no different from usual. Only two new properties must be set using the `@set_property` decorator:

1) `@set_property("input", "pd.Series")` tells the function that the input of the function is a `pd.Series` rather than a numpy array. This allows the index to be used.
2) `@set_property("index_type", pd.DatetimeIndex)` tells the function that the input is a DatetimeIndex, allowing it to perform calculations based on time datatypes.

For example, if we want to write a function that calculates the time between the first and last measurement, it could look something like this:

```python
@set_property("input", "pd.Series")
@set_property("index_type", pd.DatetimeIndex)
def timespan(x, param):
    ix = x.index

    # Get differences between the last timestamp and the first timestamp in seconds, then convert to hours.
    times_seconds = (ix[-1] - ix[0]).total_seconds()
    return times_seconds / float(3600)
```