## Introduction

This blogposts elaborates on how the data was processed for the DengAI data science challenge on DrivenData.org.

## Motivation for Pipelines
When it comes to feature engineering, or any kind of data processing the problem quickly arises that we have to carry the data from one processing step to the next. This act is not only tedious, but also prawn to errors. Consider the following example where we would like to impute the missing observations first with a KNN Imputation and afterwards scale the data. In the traditional approach one would have to assign the results from the first step (imputation) and feed the output from the first step into the second step, as shown in the following 

In [16]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer

data = np.array([[1, np.nan, 3], [4, 5, 6], [7, 8, np.nan]])

imputer = KNNImputer(n_neighbors=2)
imputed_data = imputer.fit_transform(data)

scaled_data = MinMaxScaler().fit_transform(imputed_data)
print(scaled_data)

[[0.  0.5 0. ]
 [0.5 0.  1. ]
 [1.  1.  0.5]]


From the example above the idea it becomes apparent, why this method is prunet to errors. Always assigning the output from the last step as the input of the following step also seems unnecessary tedious. One workaround for that would be to use pipelines. A *pipeline* is a concept from scikit-learn in which all steps are aligned and executed one after another. For the example above, this would look like this:

In [12]:
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    KNNImputer(),
    MinMaxScaler()
)

pipeline.fit_transform(data)

array([[0. , 0.5, 0. ],
       [0.5, 0. , 1. ],
       [1. , 1. , 0.5]])

The workings of pipelines do not only have much nicer syntax, they also are especially useful when are transforming not only one series, but multiple. Furthermore, this method is particularly useful within prediction tasks. This is because of the nature of training and test data.

A popular beginner mistake when doing a forecasting challenge is *data leakgage*. Data leakage describes the state in which any kind of information within the training data is used in the test data. When, for example, we mean encode a variable using the entire dataset and conduct the train-test-split afterwards, then the mean-encoded column within the test data contains information from the trainings data. Data leakage is problematic, given that is biases the prediction result.

Trying to mitigate data leakge without using pipelines is quite tedious work, which the following code snippet shows

In [18]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, test_size=0.33)

# Fitting everything on X
imputer.fit(X_train)
imputed_data = imputer.transform(transform)

scaler = MinMaxScaler().fit(X_train)
scaler.transform(X_train)

# Transforming on Y
