---
title: "Pipelines"
author: Daniel Redel
date: today
format:
  html:
    toc: true
    code-fold: false
    html-math-method: katex
jupyter: python3
---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

So far we have seen that in order to achieve a final predictive model, several steps are often required. The misuse of some of these steps in connection with others have been so common in various fields of study that have occasionally triggered warnings and criticism from the
machine learning community. 

In this chapter we focus primarily on **the proper implementation of such composite processes in connection with resampling evaluation rules**.

# Cross-Validation

**Idea**: any held-out fold in each iteration of cross-validation, which serves as a test set for evaluating the surrogate model, must be truly treated as unseen data. In other words, the held-out fold should not be used in steps such as normalization and/or feature selection/extraction. After all, “unseen” data can naturally not be used in these steps.

**The Mistake**: Insofar as feature selection and cross-validation are concerned, the most common mistake is to apply feature selection on the full data at hand, construct a classifier using selected features, and then use cross-validation to evaluate the performance of the classifier. This practice leads to selection bias, which is caused because the classifier is evaluated based on samples that were used in the first place to select features that are part of the classifier