# Target Leakage in Machine Failure Prediction

In real-world machine learning projects, it is common to find features that appear to improve model performance — but only because they include information that would not be available at prediction time. This problem is known as **target leakage**.

In this notebook, we explore target leakage using a simulated machine failure dataset. The goal is to predict whether a machine will fail based on available sensor and usage data.

We will:

- Train a model using **all features**, including those that leak future information
- Observe artificially high accuracy caused by leakage
- Retrain the model using **only valid features**
- Compare the results and discuss why leakage is harmful in practice

By the end of this notebook, you will understand how target leakage can mislead model evaluation and how to build models that generalize better to real-world scenarios.

## Import data

In [1]:
import pandas as pd
df = pd.read_csv("Machine_Failure_Dataset.csv")
display(df)

Unnamed: 0,machine_id,operating_hours,temperature,vibration_level,pressure,repair_cost,downtime_minutes,machine_failed
0,M_000,7370,78.3,0.38,106.7,0.00,0.0,0
1,M_001,960,84.8,0.57,111.6,0.00,0.0,0
2,M_002,5490,70.2,0.40,86.1,0.00,0.0,0
3,M_003,5291,73.1,0.58,99.1,0.00,0.0,0
4,M_004,5834,63.9,0.62,51.4,1545.49,387.1,1
...,...,...,...,...,...,...,...,...
95,M_095,491,77.3,0.60,97.6,0.00,0.0,0
96,M_096,5992,77.9,0.35,99.7,2726.64,135.2,1
97,M_097,3661,67.9,0.45,85.0,0.00,0.0,0
98,M_098,6284,93.7,0.63,99.7,0.00,0.0,0


## First Trial (train a model)

In this step, we trained a logistic regression model using all available features in the dataset.

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df.drop(columns=["machine_id"], inplace=True)
X = df.drop(columns=["machine_failed"])
y = df["machine_failed"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", round(accuracy, 3))

Accuracy: 1.0


## %100 Accuracy? How?

The model achieved high accuracy on the test data.

However, it is important to note that some of these features are only known **after** a machine failure occurs. For example, `repair_cost` and `downtime_minutes` are recorded once the failure has already happened. Including them during training introduces **target leakage**.

As a result, the model's performance appears better than it actually is. In a real-world deployment, these features would not be available when trying to predict failure, making this model unreliable in practice.

## What Happens Without Leakage?

To understand the real performance of the model, we will now train the same logistic regression model using **only the features that are available before a failure happens**.

This includes:

- `operating_hours`
- `temperature`
- `vibration_level`
- `pressure`

By removing the leaky features, we aim to simulate a realistic scenario where the model predicts failures based only on real-time sensor and usage data.

## Second Trial

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = df.drop(columns=["repair_cost", "downtime_minutes","machine_failed"])
y = df["machine_failed"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", round(accuracy, 3))

Accuracy: 0.55


## Realistic Accuracy

After removing the leaky features, the model’s accuracy dropped down to 0.55

This is expected. Without access to future information, the model has to rely only on current sensor readings and operational data. This gives us a more realistic estimate of how the model would perform in a real-world setting.

Even though the accuracy is lower, this version of the model is much more trustworthy. It reflects the actual performance we can expect when using the model in production to make predictions before a failure occurs.