# Steel Plates Fault Detection – Machine Learning Academic Report

**This academic report covers Machine Learning work from **two full projects** completed in this course:**
- `ml-project` – Cardiovascular Disease Prediction (Data Mining + Machine Learning)
- `Steel_Fault_Project` – Steel Plates Fault Detection (Data Mining + Machine Learning + Optimization)

**Institution:** Istanbul Nişantaşı University  
**Course:** Machine Learning and Pattern Recognition (Makine Öğrenme ve Örüntü Tanıma)  
**Instructor:** Dr. Öğr. Gülsüm Şanal  
**Date:** December 2025

---

## Project Team

Contributors (alphabetical order):

- **Aigul Salimgareeva** (20251555001)
- **Nima Taghipour Chokami** (20241555012)
- **Rayan Aksu** (20251556003)

*All team members contributed equally to this project.*

---

## 1. Projects Overview

During the semester, we worked on **two complete projects** that share a similar Machine Learning pipeline:

1. **Steel Plates Fault Detection – Machine Learning Project** (`Project_2_MachineLearning` in `Steel_Fault_Project_v2`)
2. **Cardiovascular Disease Prediction – Machine Learning Part** of the original **ml-project**

In both projects, we start from a **processed and feature-engineered dataset** (output of the Data Mining phase) and then:

- Split data into train/test sets
- Scale features where necessary
- Train multiple classification algorithms
- Evaluate and compare models using standard metrics
- Perform hyperparameter optimization
- Save the best models for later use

In the following sections, we first describe the **Steel Fault Detection** Machine Learning project (Section 2), and then we summarize the **Cardiovascular Disease** Machine Learning work from the ml-project (Section 4).

---

## 2. Steel Fault Detection – Machine Learning (Current Project)

### 2.1 Data Pipeline

The Machine Learning project for steel plates uses the engineered dataset produced by the Data Mining project:

- Input file: `Project_2_MachineLearning/data/processed/steel_plates_engineered.csv`
- Source: `03_feature_engineering_EN/TR` in `Project_1_DataMining`
- Features: 27 original + 23 engineered ≈ 50 total
- Target: fault class (multi-class classification)

The data pipeline can be summarized as:

```
Project_1_DataMining           →   Project_2_MachineLearning
01_data_exploration            →   01_model_training
02_data_preprocessing          →   02_model_evaluation
03_feature_engineering (50 features)
          ↓
steel_plates_engineered.csv → models and metrics
```

### 2.2 Model Training (01_model_training)

In `01_model_training_EN/TR`, we trained and compared **8 classification models** on the engineered steel dataset:

1. **Logistic Regression** – Linear baseline model
2. **Decision Tree** – Rule-based, interpretable model
3. **Random Forest** – Ensemble of decision trees
4. **K-Nearest Neighbors (KNN)** – Instance-based model
5. **Support Vector Machine (SVM)** – Margin-based classifier
6. **Naive Bayes** – Probabilistic classifier
7. **Neural Network (MLPClassifier)** – Multi-layer perceptron
8. **Gradient Boosting** – Boosted trees

**Workflow:**

- Load `steel_plates_engineered.csv`
- Separate features `X` and target `y`
- Create **train/test split** with stratification to preserve class distribution
- Apply `StandardScaler` for algorithms sensitive to scale (LR, KNN, SVM, NN)
- Train each model with reasonable hyperparameters
- Evaluate models using:
  - Accuracy
  - Precision, Recall, F1-Score
  - Cross-validation where appropriate
- Collect all results into a comparison table (`model_comparison_results.csv`)

**Model saving:**

At the end of training, we created a `models/` directory and saved:

- `*_model.pkl` – all trained models
- `scaler.pkl` – fitted `StandardScaler`
- `label_encoder.pkl` – encoder for class labels
- `model_comparison_results.csv` – metrics for all models
- `best_model.pkl` – best-performing model selected by our chosen metric

This makes the ML project **reproducible and deployable**.

### 2.3 Model Evaluation (02_model_evaluation)

In `02_model_evaluation_EN/TR`, we performed **deeper analysis** of selected models (especially the best ones, such as Random Forest):

- Generated confusion matrices
- Calculated per-class precision, recall, and F1-scores
- Investigated which classes are harder to predict
- Visualized results using plots in the `figures/` directory

This evaluation step connects the raw metrics (Accuracy, F1, etc.) to **practical interpretation** for fault detection: which defect types are often confused, and where the model performs best or worst.

---

## 3. Steel Fault Optimization – Model Optimization Project

The optimization phase is implemented as a **separate project**: `Project_3_Optimization`.

### 3.1 Goal

Compare different hyperparameter optimization methods for Random Forest on the engineered steel dataset:

- **Grid Search** – Exhaustive search over a predefined parameter grid
- **Random Search** – Random sampling from parameter distributions

### 3.2 Workflow (01_model_optimization)

- Load `Project_3_Optimization/data/processed/steel_plates_engineered.csv`
- Split into train/test with stratification
- Apply `StandardScaler` where necessary
- Define parameter spaces:
  - For Grid Search: explicit lists of values (e.g., `n_estimators`, `max_depth`, `min_samples_split`)
  - For Random Search: distributions over ranges (e.g., `randint` for `n_estimators`)
- Run 5-fold cross-validation for each optimization method
- Measure:
  - Best cross-validation score
  - Test score on hold-out set
  - Total computation time
- Visualize search results (e.g., scatter plots of mean CV score vs. configuration index)

### 3.3 Saving Optimized Models

At the end of optimization, we saved:

- `random_forest_optimized_grid.pkl` – best model from Grid Search
- `random_forest_optimized_random.pkl` – best model from Random Search
- `best_optimized_model.pkl` – best overall optimized model
- `optimized_model_comparison.csv` – comparison of methods and scores

This project is conceptually similar to **Phase 3: Hyperparameter Tuning** in the ml-project, but here it is isolated as a dedicated **Optimization** project.

---

## 4. Cardiovascular Disease – Machine Learning (ml-project)

In the original **ml-project**, the Machine Learning part used a **cardiovascular disease dataset** (70,000 records). After Data Mining (EDA, preprocessing, and feature engineering), we trained and compared several models.

### 4.1 Data Preparation

- Input: `cardio_engineered.csv` (cleaned and feature-engineered dataset)
- Train/test split:
  - 80% train, 20% test
  - Stratified by target (disease vs no disease)
- `StandardScaler` applied to continuous features for distance- and margin-based models

### 4.2 Models Trained

We trained 6 classification algorithms:

1. Logistic Regression
2. Decision Tree
3. Random Forest
4. K-Nearest Neighbors (KNN)
5. Support Vector Machine (SVM)
6. Naive Bayes

We evaluated each model using:

- Accuracy
- Precision
- Recall
- F1-Score
- ROC-AUC

The **best model** was **Random Forest**, achieving approximately:

- Accuracy ≈ 73.76%
- ROC-AUC ≈ 0.80

### 4.3 Optimization and Cross-Validation

In the ml-project academic report, we also performed:

- **Grid Search** for Random Forest and Logistic Regression
- Partial (resource-limited) Grid Search for SVM
- 5-fold cross-validation to estimate model stability
- Feature importance analysis for Random Forest

These steps correspond to what we later implemented as a separate **Optimization** project for steel.

---

## 5. Comparison and Key Takeaways (Machine Learning)

Despite using different datasets and domains (industrial vs medical), the Machine Learning workflow is very similar in both projects.

### 5.1 Similarities

- **Multiple algorithms** evaluated (LR, DT, RF, KNN, SVM, NB; plus NN and Gradient Boosting for steel)
- **Train/test split** with stratification
- **Scaling** with `StandardScaler` where appropriate
- Use of **standard metrics** (Accuracy, Precision, Recall, F1, ROC-AUC)
- **Random Forest** and other tree-based ensembles performed best overall
- **Optimization** using Grid Search / Random Search and cross-validation

### 5.2 Differences

- **Domain:**
  - Steel project: industrial **fault type** prediction (multi-class)
  - Cardio project: medical **disease presence** prediction (binary)

- **Feature types:**
  - Steel: mainly **geometric** and **intensity** features engineered from images
  - Cardio: **clinical** and **lifestyle** features, plus engineered medical indicators (BMI, MAP, etc.)

- **Project structure:**
  - ml-project: Data Mining, ML, and Optimization all in one project (with phases in a single academic report)
  - Steel_Fault_Project_v2: three separate but connected projects: Data Mining, Machine Learning, Optimization

### 5.3 Lessons Learned

From the Machine Learning perspective, we learned that:

- Good **data preparation and feature engineering** (from the Data Mining project) are crucial for strong ML performance
- **Tree-based ensemble methods** (especially Random Forest) are robust baselines for both industrial and medical classification problems
- **Hyperparameter optimization** (Grid Search, Random Search) provides consistent but sometimes modest improvements over well-chosen default settings
- **Cross-validation** is essential for reliable performance estimates and model selection

---

## 6. Course Learning Outcomes (Machine Learning)

Through these projects, we achieved the following learning outcomes for the **Machine Learning and Pattern Recognition** course:

- Implementing and comparing multiple classification algorithms on real datasets
- Applying proper **train/test splitting** and **scaling** strategies
- Using a wide range of **evaluation metrics** (Accuracy, Precision, Recall, F1, ROC-AUC)
- Performing **hyperparameter tuning** with `GridSearchCV` and `RandomizedSearchCV`
- Understanding **bias-variance trade-offs** across models
- Saving trained models and building **reproducible ML pipelines**

Together with the Data Mining work, this Machine Learning project for steel faults mirrors the structure and rigor of the original **ml-project**, but in an industrial setting instead of healthcare.


## Executive Summary

This academic report summarizes the **Machine Learning** work from **two full projects**:

1. **Steel Plates Fault Detection – Machine Learning Project** (`Steel_Fault_Project` / `Project_2_MachineLearning`)
2. **Cardiovascular Disease Prediction – Machine Learning Part** of the original `ml-project`

In the **Steel Fault** project, we used the engineered dataset (`steel_plates_engineered.csv`, ~50 features) produced by Data Mining and trained **8 classification models** (Logistic Regression, Decision Tree, Random Forest, K‑Nearest Neighbors, Support Vector Machine, Naive Bayes, Neural Network, Gradient Boosting). We performed stratified train/test splitting, feature scaling where needed, compared models using Accuracy, Precision, Recall, and F1, and saved all trained models plus the best model and scaler for later use. A separate Optimization project then applied Grid Search and Random Search to further tune Random Forest.

In the **Cardiovascular** ml-project, we used the cleaned and engineered medical dataset (`cardio_engineered.csv`) to train multiple algorithms (Logistic Regression, Decision Tree, Random Forest, KNN, SVM, Naive Bayes), evaluated them with Accuracy, F1, and ROC‑AUC, and applied Grid Search hyperparameter tuning (especially for Random Forest and Logistic Regression). The best model (Random Forest) achieved around **73–74% accuracy** with ROC‑AUC ≈ 0.80.

These two Machine Learning projects show that the same pipeline—**train/test split, scaling, multi‑model comparison, hyperparameter optimization, and model saving**—can be applied successfully to both an **industrial** fault detection problem and a **medical** disease prediction problem.

---
