# Iris Species Classification

## Project Overview

The **Iris dataset** is one of the most famous datasets in the field of machine learning. It was originally used in R.A. Fisher's classic 1936 paper *The Use of Multiple Measurements in Taxonomic Problems*. This dataset contains 150 samples from three different species of Iris flowers: *Iris-setosa*, *Iris-versicolor*, and *Iris-virginica*, with four features recorded for each sample.

The objective of this project is to build a classification model that predicts the species of an Iris flower based on the given features, using machine learning techniques.

## Objective

The goal of this project is to:

1. Understand and visualize the dataset.
2. Train a classification model to accurately predict the species of Iris flowers based on the four features.
3. Evaluate the performance of different classification models using metrics such as accuracy, precision, recall, and F1-score.

## Dataset Description

The dataset includes the following columns:

- **Id**: Unique identifier for each sample.
- **SepalLengthCm**: Sepal length of the flower in centimeters.
- **SepalWidthCm**: Sepal width of the flower in centimeters.
- **PetalLengthCm**: Petal length of the flower in centimeters.
- **PetalWidthCm**: Petal width of the flower in centimeters.
- **Species**: The species of the flower (*Iris-setosa*, *Iris-versicolor*, or *Iris-virginica*).

### Summary:

- **Number of Observations**: 150
- **Number of Features**: 5 (4 features + 1 target variable)

### Problem Statement

- **Model Training**: Train the machine learning model with the data so that it can able to predict the class of Iris species.
- **Model Evaluation**: The objective of model evaluation is to evaluate the performance of the trained ML model using different evaluation metrics such as accuracy, presicion, recal and F1 scores.
- **Model Optimization** Optimize the performance of the trained ML model using cross validation and hyperparameter tuning so that it can predict the class of Iris species more accurately.



### Load Libraries

In [1]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.model_selection import train_test_split

# Model and Evaluation Metrics
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Settings

In [2]:
# Warning
warnings.filterwarnings("ignore")
# Path
data_path = "../data"
model_path = "../models"
csv_path = os.path.join(data_path, "Iris_cleaned.csv")

### Load Data

In [3]:
df = pd.read_csv(csv_path)

In [4]:
# Check data
df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### Preprocessing

In [5]:
# Separate the Input and Output Features
X = df.iloc[:, :-1]
y = df["Species"]

In [6]:
# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.4, random_state= 42)

### Model Training and Evaluation

In [7]:
# Define a function to train the model and evaluate
def train_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Make prediction on training and testing data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Print model evaluation scores
    print("=" * 60)
    print("EVALUATION METRICS FOR TRAINING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred): 0.2f}")
    print(f"Precision: {precision_score(y_train, y_train_pred, average='weighted'): 0.2f}")
    print(f"Recall: {recall_score(y_train, y_train_pred, average='weighted'): 0.2f}")
    print(f"F1: {f1_score(y_train, y_train_pred, average='weighted'): 0.2f}")
    print("=" * 60)
    print("EVALUATION METRICS FOR TESTING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred): 0.2f}")
    print(f"Precision: {precision_score(y_test, y_test_pred, average='weighted'): 0.2f}")
    print(f"Recall: {recall_score(y_test, y_test_pred, average='weighted'): 0.2f}")
    print(f"F1: {f1_score(y_test, y_test_pred, average='weighted'): 0.2f}")

In [9]:
# Try with SVM Classifier
svc = SVC()
train_evaluate(svc)

EVALUATION METRICS FOR TRAINING
Accuracy:  0.96
Precision:  0.96
Recall:  0.96
F1:  0.96
EVALUATION METRICS FOR TESTING
Accuracy:  1.00
Precision:  1.00
Recall:  1.00
F1:  1.00


### Conclusion

The evaluation metrics for the Iris species classification using the SVM classifier indicate excellent performance, particularly on the test set, where the model achieves perfect scores. Let's discusss the result in details,

#### Training Metrics:

- **Accuracy (0.96)**: The model correctly classifies **96%** of the training data. This is a strong performance, indicating that the model has learned the patterns in the training data well.
- **Precision, Recall, F1 Score (all 0.96)**: These metrics, each at **96%**, suggest that the model is doing a great job of both identifying the correct species (high precision) and finding most of the species present in the data (high recall). The F1 score, which balances precision and recall, also reflects this good performance.

#### Testing Metrics:

- **Accuracy (1.00)**: The model achieves **100%** accuracy on the test set, meaning it correctly classified every test example.
- **Precision, Recall, F1 Score (all 1.00)**: Precision, recall, and F1 scores of **1.00** suggest that the model perfectly identified every species in the test data without any false positives or false negatives. The model is making flawless predictions on the unseen data.

#### Explanation of Performance:

- **High Training Accuracy (96%) with Perfect Test Accuracy (100%):** The near-perfect training accuracy shows that the SVM classifier fits the training data well, capturing the distinctions between the different species of iris flowers. The slight difference between training accuracy (**96%**) and test accuracy (**100%**) is not a concern, as the test performance is perfect, meaning the model generalizes exceptionally well to unseen data.
- **Perfect Metrics on the Test Set:** Achieving perfect precision, recall, and F1 scores on the test set is unusual but possible for simpler classification problems like Iris species classification. The Iris dataset is relatively small and well-behaved (no missing data, linearly separable classes), making it easier for the SVM model to achieve perfect results.
- **No Overfitting or Underfitting:** The model does not show signs of overfitting or underfitting. Overfitting occurs when a model performs well on training data but poorly on testing data, which is not the case here. Similarly, underfitting occurs when the model performs poorly on both sets, which is also not happening. The small gap between the training and test accuracy (96% vs. 100%) suggests that the model is not over-complicating the relationships in the training data but has learned the decision boundaries between species very well.