# Steel Plates Fault Detection – Data Mining Academic Report

**This academic report covers Data Mining work from **two full projects** completed in this course:**
- `ml-project` – Cardiovascular Disease Prediction (Data Mining + Machine Learning)
- `Steel_Fault_Project` – Steel Plates Fault Detection (Data Mining + Machine Learning + Optimization)

**Institution:** Istanbul Nişantaşı University  
**Course:** Data Mining (Veri Madenciliği)  
**Instructor:** Dr. Öğr. Gülsüm Şanal  
**Date:** December 2025

---

## Project Team

Contributors (alphabetical order):

- **Aigul Salimgareeva** (20251555001)
- **Nima Taghipour Chokami** (20241555012)
- **Rayan Aksu** (20251556003)

*All team members contributed equally to this project.*

---

## 1. Projects Overview

During the semester, we worked on **two complete projects**:

1. **Steel Plates Fault Detection – Data Mining Project** (this report focuses on the Data Mining part of `Steel_Fault_Project_v2`)
2. **Cardiovascular Disease Prediction – Data Mining Part** of our original **ml-project** (cardiovascular disease dataset from Kaggle)

Both projects follow the same data mining methodology learned in the course: start from raw data, perform systematic **Exploratory Data Analysis (EDA)**, **preprocessing**, and **feature engineering**, and produce a clean, information-rich dataset for machine learning.

In the following sections, we first describe the **Steel Fault Detection** Data Mining work (Section 2), and then we summarize the **Cardiovascular Disease** Data Mining work from the ml-project (Section 3).

---

## 2. Steel Fault Detection – Data Mining (Current Project)

### 2.1 Goal and Dataset

**Goal:** Build a high-quality, engineered feature set that describes the **geometry**, **intensity**, and **texture** of steel plates for fault classification.

**Context:**

- Project: `Project_1_DataMining` in `Steel_Fault_Project_v2`
- Raw file: `data/raw/steel_plates_fault.csv`
- Task: Multi-class classification of **fault types** on steel plates
- Tools: Python (pandas, numpy, matplotlib, seaborn), scikit-learn, Jupyter Notebook

The raw dataset contains geometric and pixel-based measurements extracted from steel plate images, together with a target column indicating different fault types (e.g., scratches, patches, inclusions).

### 2.2 Exploratory Data Analysis (01_data_exploration)

In `01_data_exploration_EN.ipynb` and `01_data_exploration_TR.ipynb` we:

- Inspected the **structure** and **data types** (continuous vs categorical features, target classes)
- Examined **distributions** of geometric and pixel-based features
- Visualized **correlations** between features and fault classes
- Checked **data quality**:
  - Missing values
  - Duplicates
  - Obvious impossible values (e.g., negative areas, zero widths or heights)
- Analyzed **class distribution** to see which fault types are rare or dominant

**Key EDA findings:**

- The dataset has many geometric features (width, height, area, perimeter, bounding box measures)
- Some classes are less frequent, which is important for later model evaluation
- The raw features are on very different scales (e.g., areas vs ratios), which motivates scaling before ML

### 2.3 Preprocessing (02_data_preprocessing)

In `02_data_preprocessing_EN/TR` we transformed the raw data into a clean, analysis-ready dataset:

- Detected and removed **duplicates** if present
- Handled **missing values** (none or very few; if present, removed or imputed logically)
- Validated **value ranges**:
  - Enforced positive values for sizes and areas
  - Removed or corrected clearly invalid records (e.g., zero width with positive area)
- Applied **scaling** to continuous features (e.g., `StandardScaler`) for downstream models that are sensitive to scale

At the end of preprocessing, we saved a clean dataset:

- `data/processed/steel_plates_preprocessed.csv`
- Additionally, train/test splits (`X_train.npy`, `X_test.npy`, `y_train.npy`, `y_test.npy`) for reproducible ML experiments

**Preprocessing summary:**

- Removed a small number of invalid or inconsistent records
- Ensured all geometric features had physically meaningful ranges
- Prepared a stable base for feature engineering and model training

### 2.4 Feature Engineering (03_feature_engineering)

The core of this Data Mining project is **feature engineering**. Using the clean steel plates dataset, we created many new, higher-level features that describe the **shape** and **intensity** of defects more effectively than the raw measurements.

Examples of engineered features:

- **Geometric / Shape Features**
  - Aspect ratio (`width / height`)
  - Bounding box area and area-perimeter ratio
  - Center coordinates (`center_x`, `center_y`)
  - Compactness and extent
  - Shape complexity measures based on combinations of area, perimeter, and bounding box

- **Intensity / Texture Features**
  - Luminosity range and mean luminosity in defect regions
  - Pixel density features that capture how "dense" a defect region is
  - Combined size–luminosity features (e.g., large & bright vs small & dark defects)

We then merged the original and engineered features into a single dataframe and saved:

- `data/processed/steel_plates_engineered.csv` – the main engineered dataset
- `feature_names.txt` – all feature names
- `engineered_feature_names.txt` – only engineered features
- `new_feature_names.txt` – newly created features

This engineered dataset (≈50 features: 27 original + 23 engineered) is the **input** for:

- `Project_2_MachineLearning` (model training and evaluation)
- `Project_3_Optimization` (hyperparameter optimization)

**Impact:**

- Encodes geometry and intensity of faults in a more informative way
- Improves the expressive power of models compared to using raw measurements alone
- Creates a clear, documented data pipeline from raw steel data to ML-ready features

---

## 3. Cardiovascular Disease – Data Mining (ml-project)

In the original **ml-project**, the Data Mining part focused on a **medical tabular dataset** (70,000 patient records) for cardiovascular disease prediction. Although the domain is different, the data mining logic is similar.

### 3.1 Exploratory Data Analysis

We:

- Checked dataset shape, data types, missing values, and duplicates
- Analyzed distributions for age (stored in days), height, weight, blood pressure, cholesterol, glucose, smoking, alcohol, and physical activity
- Identified key **data quality issues**:
  - Age stored in **days** instead of years
  - **Outliers** in height, weight, and blood pressure
  - **Impossible values** where diastolic blood pressure ≥ systolic
- Verified that the target variable (cardio vs no cardio) was approximately **balanced** (~50–50)

### 3.2 Preprocessing

Main steps:

- Converted **age from days to years** for interpretability
- Removed **outliers** based on **medical knowledge**:
  - Realistic ranges for height, weight, systolic and diastolic blood pressure
- Removed **physiologically impossible** records (e.g., diastolic ≥ systolic)
- Ensured no missing values or duplicates remained

Final dataset size: **68,454 records**, meaning 2.21% of data was removed as outliers or impossible values. Class balance was preserved.

### 3.3 Feature Engineering

We engineered **7 medically informed features**:

- BMI and BMI category
- Pulse pressure
- Mean arterial pressure (MAP)
- Blood pressure category (AHA stages)
- Lifestyle risk score (smoke, alcohol, activity)
- Age groups

These features:

- Incorporated domain knowledge about cardiovascular risk
- Had strong correlations with the target
- Significantly improved model performance in the ml-project

---

## 4. Comparison and Key Takeaways (Data Mining)

Although the two projects use different datasets and domains, they apply the **same data mining principles**:

- **Domain:**
  - Steel project: industrial **image/geometry** data → engineered **shape and intensity** features
  - Cardio project: **clinical & lifestyle** data → engineered **medical risk** features

- **Preprocessing focus:**
  - Steel: structural validity of geometric and intensity features
  - Cardio: physiologically valid ranges and removal of impossible vital signs

- **Feature Engineering:**
  - Steel: geometry/texture-driven (aspect ratio, compactness, luminosity, etc.)
  - Cardio: medically driven (BMI, MAP, BP categories, lifestyle risk)

In both cases, **Data Mining is the foundation** of the project:

- Without clean and well-engineered data, later Machine Learning and Optimization steps would show much weaker performance
- The data mining phase encodes our course knowledge from **Veri Madenciliği** into a robust data pipeline

---

## 5. Course Learning Outcomes (Data Mining)

Through these projects, we achieved the following learning outcomes for the **Data Mining** course:

- Systematic **EDA** on real-world datasets (industrial and medical)
- Practical **data cleaning** (outlier removal, impossible values, duplicates)
- **Data transformation** (e.g., age days → years)
- **Feature engineering** based on domain knowledge (geometry and medicine)
- Building reproducible **data pipelines** that feed into ML projects

These outcomes demonstrate how theory from the course can be applied to two very different but realistic problems: industrial fault detection and cardiovascular disease prediction.


## Executive Summary

This academic report summarizes the **Data Mining** work from **two full projects**:

1. **Steel Plates Fault Detection – Data Mining Project** (`Steel_Fault_Project` / `Project_1_DataMining`)
2. **Cardiovascular Disease Prediction – Data Mining Part** of the original `ml-project`

In the **Steel Fault** project, we started from the raw `steel_plates_fault.csv` dataset, performed detailed **Exploratory Data Analysis (EDA)**, cleaned the data (removing invalid and inconsistent records), and engineered around **50 features** that capture the **geometry** and **intensity/texture** of surface defects. The final engineered dataset (`steel_plates_engineered.csv`) is the foundation for the Machine Learning and Optimization projects.

In the **Cardiovascular** project, we worked with a 70,000‑record medical dataset from Kaggle, converted age from days to years, removed **outliers** and **impossible values** (e.g., diastolic ≥ systolic), and engineered **7 medically informed features** (BMI, MAP, Pulse Pressure, BP categories, Lifestyle Risk Score, Age Groups). The final cleaned dataset (68,454 records) and engineered features significantly improved model performance in the ml-project.

Together, these two projects demonstrate how the concepts from the **Data Mining** course—EDA, preprocessing, and feature engineering—are applied in both an **industrial** (steel faults) and a **medical** (cardiovascular disease) context.

---
