# Exploratory Data Analysis - Predictive Maintenance (AI4I 2020)

**Goal:** Predict if a machine will fail within the next 24 hours.

This notebook explores the dataset structure, distributions, and failure patterns.

**Setup:** 1) **Close this notebook** (or stop kernel). 2) In terminal run: `.\venv\Scripts\Activate.ps1` then `pip install -r requirements.txt`. 3) Reopen notebook, select kernel **"Python 3.11 (Predictive Maintenance)"**.

In [11]:
# Run this cell to register this Python env as the notebook kernel (connects venv + requirements.txt)
import subprocess
import sys
result = subprocess.run(
    [sys.executable, "-m", "ipykernel", "install", "--user",
     "--name=pm-env", "--display-name=Python 3.11 (Predictive Maintenance)"],
    capture_output=True, text=True
)
if result.returncode == 0:
    print("Kernel registered. Select 'Python 3.11 (Predictive Maintenance)' via Select Kernel, then reload.")
else:
    print("Run in terminal: pip install ipykernel")


Kernel registered. Select 'Python 3.11 (Predictive Maintenance)' via Select Kernel, then reload.


In [14]:
# INSTALL PACKAGES (run in terminal - pip fails from notebook due to file lock):
# .\venv\Scripts\Activate.ps1
# pip install -r requirements.txt
# Then restart kernel and run cells below.

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not install packages due to an OSError: [WinError 32] The process cannot access the file because it is being used by another process: 'c:\\Users\\My-Pc\\Desktop\\Infotact_project\\venv\\Lib\\site-packages\\scipy\\odr\\_odrpack.py'
Check the permissions.


[notice] A new release of pip is available: 24.0 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
# If you get ModuleNotFoundError: pip install failed (file lock). Run in terminal instead:
# .\venv\Scripts\Activate.ps1
# pip install -r requirements.txt
# Then restart kernel and run notebook.

In [17]:
# Verify Python 3.11 & environment (requirements.txt packages must be installed in this kernel)
import sys
print(f"Python {sys.version}")
print(f"Executable: {sys.executable}")

Python 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Executable: c:\Users\My-Pc\Desktop\Infotact_project\venv\Scripts\python.exe


In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.style.use('ggplot')
sns.set_palette('husl')

ModuleNotFoundError: No module named 'matplotlib'

In [21]:
# Load data
df = pd.read_csv('archive/ai4i2020.csv')
df.head(10)

NameError: name 'pd' is not defined

In [None]:
# Basic info
print("Shape:", df.shape)
print("\nData types:")
df.dtypes

In [None]:
# Missing values
df.isnull().sum()

In [None]:
# Statistical summary
df.describe()

In [None]:
# Categorical columns
print("Type distribution:")
print(df['Type'].value_counts())
print("\nProduct ID sample (first char = quality):")
print(df['Product ID'].str[0].value_counts())

## Failure Analysis

In [None]:
# Machine failure distribution
failure_counts = df['Machine failure'].value_counts()
print("Machine failure:")
print(failure_counts)
print(f"\nFailure rate: {df['Machine failure'].mean()*100:.2f}%")

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
failure_counts.plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
axes[0].set_title('Machine Failure Distribution')
axes[0].set_xticklabels(['No Failure', 'Failure'], rotation=0)

# Failure modes
failure_modes = ['TWF', 'HDF', 'PWF', 'OSF', 'RNF']
mode_counts = df[failure_modes].sum()
mode_counts.plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Failure Mode Counts')
plt.tight_layout()
plt.show()

In [None]:
# Failure distribution over UDI (time)
fig, ax = plt.subplots(figsize=(14, 4))
ax.scatter(df['UDI'], df['Machine failure'], alpha=0.3, s=5, c=df['Machine failure'], cmap='RdYlGn_r')
ax.set_xlabel('UDI (sequential index)')
ax.set_ylabel('Machine Failure')
ax.set_title('Failures Over Time (UDI)')
plt.tight_layout()
plt.show()

## Feature Distributions

In [None]:
# Numeric features
numeric_cols = ['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 
                'Torque [Nm]', 'Tool wear [min]']
df[numeric_cols].hist(bins=50, figsize=(14, 10), layout=(2, 3))
plt.suptitle('Numeric Feature Distributions', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Box plots: features by failure status
fig, axes = plt.subplots(2, 3, figsize=(14, 10))
axes = axes.flatten()
for i, col in enumerate(numeric_cols):
    df.boxplot(column=col, by='Machine failure', ax=axes[i])
    axes[i].set_title(col)
    axes[i].set_xlabel('Failure')
axes[-1].axis('off')
plt.suptitle('Features by Machine Failure Status', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix
corr = df[numeric_cols + ['Machine failure']].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0, square=True)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

## Derived Features (Domain Knowledge)

From the dataset paper:
- **Power** = Torque × Rotational speed (rad/s) → related to PWF
- **Temp diff** = Process temp - Air temp → related to HDF (fail if < 8.6 K)
- **Overstrain** = Tool wear × Torque → related to OSF

In [None]:
# Create derived features
df_derived = df.copy()
df_derived['Power_W'] = df_derived['Torque [Nm]'] * df_derived['Rotational speed [rpm]'] * (2 * np.pi / 60)
df_derived['Temp_diff_K'] = df_derived['Process temperature [K]'] - df_derived['Air temperature [K]']
df_derived['Overstrain_proxy'] = df_derived['Tool wear [min]'] * df_derived['Torque [Nm]']

derived_cols = ['Power_W', 'Temp_diff_K', 'Overstrain_proxy']
df_derived[derived_cols].describe()

In [None]:
# Derived features by failure
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for i, col in enumerate(derived_cols):
    df_derived.boxplot(column=col, by='Machine failure', ax=axes[i])
    axes[i].set_title(col)
plt.suptitle('Derived Features by Failure Status', y=1.02)
plt.tight_layout()
plt.show()

## Key Insights

1. **Class imbalance:** ~3.4% failure rate - need SMOTE or class weights
2. **Sequential data:** UDI is chronological - use time-based train/test split
3. **Target:** Create `failure_in_24h` = 1 if any failure in next 24 rows
4. **Features:** Power, Temp_diff, Overstrain_proxy are domain-relevant