# How to split data for pre-processing?

Assume that we have a time-series dataset for which we want to perform data pre-processing, which includes the extraction of window & lag features. Later are using previous time-series data points for the computation of features.

In [None]:
# Sklearn imports
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import TimeSeriesSplit

# Feature Engine imports 
from feature_engine.imputation import DropMissingData

# Sktime imports
from sktime.transformations.series.lag import Lag

# Time series data
y = ... 

# Set up the pipeline
feature_pipeline = Pipeline([
    # 1. Standard scaler learns parameters from the data
    ('scaler', StandardScaler()),
    # 2. Simpleme imputer (with constant value) does not learn parameters from the data
    ('imputer', SimpleImputer(strategy="constant", fill_value=666)),
    # 3. Lag feature generator looks at historical values to generate features 
    ('lagger', Lag())
    # 4. NaN-remover (could be also handled somewhere else)
    ("nan-remover", DropMissingData())
])

# Split the data into training and test
train, test = next(TimeSeriesSplit().split(y))

# -- Now it comes -- 
# Train the pipeline using the training data
feature_pipeline.fit(y[train])

# Apply the transformation to the training data to later train the machine learning model
fy_train = feature_pipeline.transform(y[train])

# Apply the transformation to generate the test data
# - I apply the transformation to all data points (train + test).
# - That allows me to generate, e.g., lag-features (using historical values), for ALL test points.
# - There are no missing values in the test set features (most likely)
fy_test = feature_pipeline.transform(y)[test]  # <-- !!!

Outlined above, I eleborated on my idea of how we can use the transformer pipeline to generate the training and test features. 

#### Performance

We might want to somehow restrict the test set pipeline such, that only the needed part of the training set is processed. In this way we can reduce the computation time.

#### Data leakage

There is a potential data leak between training and test here. That comes in the overlap of between training and test set for the test set lag features. Some of the training data points will be transformed using models which have been used to train them. 

Whether that is a real problem, is probably depending on the actual feature extraction pipeline.

#### Alternative approach 

We can do a small modification to the code and make a clear separation between training and test. However, in this case (as we use lag-features) we will "loose" some data for training AND testing. 

In [None]:
# Apply the transformation to generate the test data
fy_test = feature_pipeline.transform(y[test])