# Data Cleaning and Validation â€“ AI4I 2020 Predictive Maintenance

## Objective
Validate data quality and prepare a clean dataset for modeling by checking target integrity, missing or inconsistent values, duplicate records, and sensor value ranges, while preserving the raw dataset unchanged.


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/cleaned/ai4i2020_cleaned.csv")
df.shape


(10000, 14)

In [2]:
df["Machine failure"].unique()


array([0, 1])

In [3]:
df["Machine failure"].value_counts()


Machine failure
0    9661
1     339
Name: count, dtype: int64

The target variable is binary and contains no unexpected values, confirming suitability for supervised classification modeling.


The target variable is binary, with 9,661 non-failure observations and 339 failure events. This confirms a highly imbalanced classification problem, which reflects real-world predictive maintenance scenarios. The target variable is valid and requires no cleaning.


In [4]:
df.isna().sum().sort_values(ascending=False)


UDI                        0
Product ID                 0
Type                       0
Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
Machine failure            0
TWF                        0
HDF                        0
PWF                        0
OSF                        0
RNF                        0
dtype: int64

Missing values were checked across all variables to assess the need for imputation or row removal.


No missing values were detected across any variables in the dataset. Therefore, no imputation or row removal was required at this stage.


In [5]:
df.duplicated().sum()


np.int64(0)

No duplicate records were found in the dataset. This confirms that each observation represents a unique machine operation instance, and no records needed to be removed.


In [6]:
sensor_checks = {
    "Air temperature <= 0": (df["Air temperature [K]"] <= 0).sum(),
    "Process temperature <= 0": (df["Process temperature [K]"] <= 0).sum(),
    "Rotational speed <= 0": (df["Rotational speed [rpm]"] <= 0).sum(),
    "Torque < 0": (df["Torque [Nm]"] < 0).sum(),
    "Tool wear < 0": (df["Tool wear [min]"] < 0).sum(),
}

pd.Series(sensor_checks)


Air temperature <= 0        0
Process temperature <= 0    0
Rotational speed <= 0       0
Torque < 0                  0
Tool wear < 0               0
dtype: int64

All sensor variables were validated for physically implausible values (e.g., non-positive temperatures or RPM, negative torque or tool wear). No invalid values were detected, indicating reliable sensor data and eliminating the need for corrective filtering.


In [7]:
df_cleaned = df.copy()
df_cleaned.shape


(10000, 14)

In [8]:
df_cleaned.to_csv("../data/cleaned/ai4i2020_cleaned.csv", index=False)


A cleaned version of the dataset was created after validation checks confirmed the absence of missing values, duplicates, or invalid sensor readings. The cleaned dataset preserves all original observations and is suitable for downstream feature engineering and modeling.


## Cleaning Decisions Summary

- The raw dataset was preserved unchanged to maintain data integrity.
- No missing values or duplicate records were identified.
- Sensor readings were within realistic and physically valid ranges.
- Extreme values observed during EDA were retained, as they represent genuine operating conditions rather than data errors.
- A cleaned dataset was generated and saved for subsequent modeling stages.
