# Using Pipeline with separate datasets for train and test data

This notebook shows how to use the RelevantFeatureAugmenter in pipelines where you first train on samples from dataset `df_train` but then want to test using samples from `df_test`.

The trick is just to call `ppl.set_params(fresh__timeseries_container=df)` for each of the datasets.

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures

In [None]:
from tsfresh.examples import load_robot_execution_failures
from tsfresh.transformers import RelevantFeatureAugmenter

We are going to use the same dataset initialized twice, but lets pretend that we are initializing two separate datasets `df_train` and `df_test`:

In [None]:
download_robot_execution_failures
df_train, y_train = load_robot_execution_failures()
df_test, y_test = load_robot_execution_failures()

In [None]:
X_train = pd.DataFrame(index=y_train.index)
X_test = pd.DataFrame(index=y_test.index)

In [None]:
ppl = Pipeline([('fresh', RelevantFeatureAugmenter(column_id='id', column_sort='time')),
                ('clf', RandomForestClassifier())])

In [None]:
ppl.set_params(fresh__timeseries_container=df_train)

In [None]:
ppl.fit(X_train, y_train)

In [None]:
ppl.set_params(fresh__timeseries_container=df_test)

In [None]:
y_pred = ppl.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))