# **Environmental Setup**

The main objective of this section is to set up the notebook's functionality. We'll load essential libraries and define important constants. This initial step establishes a functional environment for subsequent tasks.

In [1]:
# Common
import keras
import numpy as np

# Datasets
import pandas as pd
from sklearn.datasets import load_iris

# Data Visualization
import plotly.express as px

# Data Processing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Classification Models
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Model Evaluation
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

In [2]:
# Setting up random seed for reproducibility
np.random.seed(42)

# **Data Loading**

Let's initiate the **notebook** by **importing the dataset** into our environment. We'll utilize the **Iris dataset** available within the **Scikit-learn library**. This dataset will serve as the foundation for **data processing, visualization, and subsequent analytical processes.**

In [3]:
# Loading processed data
data = load_iris()
features = data['data']
targets = data['target']
class_names = data['target_names']
feature_names = data['feature_names']

# Converting into the data frame
df = pd.DataFrame({
    feature_names[0]:features[:, 0],
    feature_names[1]:features[:, 1],
    feature_names[2]:features[:, 2],
    feature_names[3]:features[:, 3],
    'label':[class_names[label] for label in targets]
})

# Quick look
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Upon inspecting the **DataFrame**, we observe four **primary features: sepal length, sepal width, petal length, and petal width.** Additionally, there are **three distinct classes** of **iris plants: Setosa, Versicolor, and Virginica.**

In [4]:
n_samples = df.shape[0]
print(f"Total Number of Samples: {n_samples}")

Total Number of Samples: 150


The dataset comprises 150 samples, which might be considered small for some tasks, but for our objective of constructing a machine learning or predictive model, this size suffices.


The dataset itself consists of 750 data points, a result of having 150 samples, each containing five attributes. Out of these attributes, four are the actual features utilized for modeling purposes, while the fifth attribute represents the target label.

In [5]:
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
label                0
dtype: int64

Acknowledging that our dataset arrives preprocessed is advantageous, as it reduces the need for extensive preprocessing steps. However, employing the standard scaler might still be beneficial.

Scaling values often proves advantageous, particularly for enhancing the performance of deep learning models, and at times, machine learning models as well. This standardization process ensures that all features adhere to a standardized scale, potentially aiding the model's training process and overall performance.

# **Exploratory Data Analysis**

Since our dataset arrives preprocessed, skipping the preprocessing steps streamlines our analysis. We'll dive straight into data analysis, focusing on the four features and their correlations, both amongst themselves and notably, with the target variable.

But before that, we need to make sure that our data is not imbalanced.

In [6]:
# Class Distribution
class_dis = df.label.value_counts()

# Visualize the class distribution
class_dis_pie_fig = px.pie(
    names = class_dis.index,
    values = class_dis.values,
    hole = 0.4,
    title = "Class Distribution Donut Plot"
)
class_dis_pie_fig.show()

class_dis_bar_fig = px.bar(
    y = class_dis.index,
    x = class_dis.values,
    color = class_dis.index,
    title = "Class Distribution Bar Plot"
)
class_dis_bar_fig.show()

Excellent news! Examining the class distribution reveals a well-balanced dataset, showcasing an equal distribution of samples across classes. Each class represents precisely 33.3% of the data, equating to one-third of the dataset.

With 50 samples allocated to each class, totaling 150 samples, this uniform distribution ensures parity among the classes and prevents any class imbalance issues within our dataset.

In [7]:
# Histogram for Sepal length.
hist_fig = px.histogram(
    df, x="sepal length (cm)",
    text_auto = True,
    color = "label",
    barmode = "group",
    title = "Sepal Length Histogram"
)
hist_fig.show()

# Box Plot for Sepal length.
box_fig = px.box(
    df, x="sepal length (cm)",
    color = "label",
    title = "Sepal Length Box Plot"
)
box_fig.show()

# Violin Plot for Sepal length.
violin_fig = px.violin(
    df, x="sepal length (cm)",
    color = "label",
    title = "Sepal Length Violin Plot"
)
violin_fig.show()

Examining the histograms, box plot, and violin plot individually reveals insightful patterns within the dataset.

Observing the histogram for sepal length, it's evident that values range from 4 to 8, exhibiting a normal distribution. While individually, each class follows a somewhat normal distribution, upon overlaying, samples from different classes merge, leading to overlap.

Moving to the box plot, it reaffirms the relationships seen in the histogram, showcasing overlapping regions indicated by the Box. Notably, as sepal length increases, the likelihood of belonging to a particular class changes: higher values tend toward Virginica, lower values towards Setosa, and intermediate values tend towards Versicolor. This segregation is evident due to the distinct mean values across classes.

Similarly, the violin plot echoes these characteristics, highlighting the absence of subgroups within the data but reaffirming the previously observed distribution patterns and overlaps among the classes.

In [8]:
# Histogram for Sepal width.
hist_fig = px.histogram(
    df, x="sepal width (cm)",
    text_auto = True,
    color = "label",
    barmode = "group",
    title = "Sepal width Histogram"
)
hist_fig.show()

# Box Plot for Sepal width.
box_fig = px.box(
    df, x="sepal width (cm)",
    color = "label",
    title = "Sepal width Box Plot"
)
box_fig.show()

# Violin Plot for Sepal width.
violin_fig = px.violin(
    df, x="sepal width (cm)",
    color = "label",
    title = "Sepal width Violin Plot"
)
violin_fig.show()

The analysis of sepal width unveils an interesting pattern akin to sepal length. It exhibits a distribution tending towards normalcy; however, classifying based on values becomes notably challenging due to heightened overlaps among the classes.

All visualizations—violin plot, box plot, and histogram—highlight this extensive overlap. The violin plot underscores the absence of subdivisions within the data, while the box plot and histogram affirm the presence of a normal distribution. Despite this, discerning distinct classes based on sepal width remains notably challenging due to the close proximity of mean values among the classes, leading to intricacies in classification.

In [9]:
# Histogram for petal length.
hist_fig = px.histogram(
    df, x="petal length (cm)",
    text_auto = True,
    color = "label",
    barmode = "group",
    title = "Petal length Histogram"
)
hist_fig.show()

# Box Plot for petal length.
box_fig = px.box(
    df, x="petal length (cm)",
    color = "label",
    title = "Petal length Box Plot"
)
box_fig.show()

# Violin Plot for petal length.
violin_fig = px.violin(
    df, x="petal length (cm)",
    color = "label",
    title = "Petal length Violin Plot"
)
violin_fig.show()

The petal length feature presents a distinctive characteristic distinguishing it from the others. Notably, there's a clear demarcation in values among the classes, primarily facilitating the separation of Setosa from the other two classes.

Setosa's petal length typically spans between 1 to 2 cm, while Versicolor spans around 3 to 5.5 and Virginica ranges from 4 to 8. This pronounced dissimilarity allows for a clear separation of Setosa from the rest.

However, for Versicolor and Virginica, there's a slight overlap visible in the histogram. Although this overlap exists, the number of data points within this region is minimal, resulting in lower uncertainty during classification.

In [10]:
# Histogram for petal width.
hist_fig = px.histogram(
    df, x="petal width (cm)",
    text_auto = True,
    color = "label",
    barmode = "group",
    title = "Petal width Histogram"
)
hist_fig.show()

# Box Plot for petal width.
box_fig = px.box(
    df, x="petal width (cm)",
    color = "label",
    title = "Petal width Box Plot"
)
box_fig.show()

# Violin Plot for petal width.
violin_fig = px.violin(
    df, x="petal width (cm)",
    color = "label",
    title = "Petal width Violin Plot"
)
violin_fig.show()

Intriguingly, a notable observation arises with petal width, akin to petal length. Once more, Setosa stands distinct, easily discernible from the other classes.

However, within the Setosa class, there's a pronounced skewness, evident in the presence of multiple subgroups with varying data points. Such skewness tends to indicate the presence of potential outliers or uncommon cases.

Despite this, the overall count for both values remains considerably high. This skewness is substantiated by the close proximity of the mean value to the first quartile in the box plot, signifying a deviation from a symmetric distribution.

In [11]:
# Visual Understanding of the Statical inferences
scatter_plot = px.scatter(
    df,
    x = "sepal length (cm)", y = "sepal width (cm)",
    color = "label",
    title = "Sepal length vs Sepal width",
    symbol = "label",
    marginal_x = "box",
    marginal_y = "box",
    height = 800,
    width = 1400
)
scatter_plot.show()

In [12]:
# Visual Understanding of the Statical inferences
density_plot = px.density_contour(
    df,
    x = "sepal length (cm)", y = "sepal width (cm)",
    color = "label",
    title = "Sepal length vs Sepal width",
    marginal_x = "box",
    marginal_y = "box",
    height = 800,
    width = 1400
)
density_plot.show()

Analyzing the relationship between sepal length and sepal width through a scatter plot reveals an interesting pattern. Setosa stands distinctly apart from the other classes in terms of both these attributes.

Setosa's range for sepal length tends towards the far left of the plot and towards the far top for sepal width, making it notably separable from the other two classes.

However, for Versicolor and Virginica, there's evident overlap, noticeable in both the marginal plots and the scatter plot itself. This overlap underscores the challenges in delineating boundaries between Versicolor and Virginica based on these attributes.

In [13]:
# Visual Understanding of the Statical inferences
scatter_plot = px.scatter(
    df,
    x = "petal length (cm)", y = "petal width (cm)",
    color = "label",
    title = "Petal length vs Petal width",
    symbol = "label",
    marginal_x = "box",
    marginal_y = "box",
    height = 800,
    width = 1400
)
scatter_plot.show()

In [14]:
# Visual Understanding of the Statical inferences
density_contour_plot = px.density_contour(
    df,
    x = "petal length (cm)", y = "petal width (cm)",
    color = "label",
    title = "Petal length vs Petal width",
    marginal_x = "box",
    marginal_y = "box",
    height = 800,
    width = 1400
)
density_contour_plot.show()

It appears that using just two features provides ample information for effective performance. Particularly, when considering petal width and petal length, all three classes exhibit notable separability.

Setosa distinctly stands apart, remarkably separated from the other classes. Versicolor and Virginica showcase a smoother blend between their boundaries. Though there exist some outliers in both classes that lie close to the boundary, overall, the transition region between these classes fades smoothly, presenting a gradual merging rather than abrupt separations.

In [15]:
# Visual Understanding of the Statical inferences
scatter_plot = px.scatter(
    df,
    x = "petal length (cm)", y = "sepal length (cm)",
    color = "label",
    title = "Petal length vs Sepal length",
    symbol = "label",
    marginal_x = "box",
    marginal_y = "box",
    height = 800,
    width = 1400
)
scatter_plot.show()

In [16]:
# Visual Understanding of the Statical inferences
density_contour_plot = px.density_contour(
    df,
    x = "petal length (cm)", y = "sepal length (cm)",
    color = "label",
    title = "Petal length vs Sepal length",
    marginal_x = "box",
    marginal_y = "box",
    height = 800,
    width = 1400
)
density_contour_plot.show()

This plot, resembling the previous one, illustrates the relationship between sepal length and petal length. Just like before, this pair of features appears promising for classification purposes.

Setosa remains distinctly discernible, while Versicolor and Virginica exhibit a smoother transition. However, in this instance, the number of points around the boundaries or outliers seems slightly higher compared to the previous scatter plot. Nonetheless, these two features still present a viable option for classification purposes.

The density plot offers insights not distinctly visible in the scatter plot. It becomes evident that the density contours for Versicolor and Virginica overlap considerably, indicating a high degree of similarity between these classes. Intriguingly, the Virginica class appears to comprise two subgroups, further adding complexity to the delineation between these classes.

In [17]:
# Visual Understanding of the Statical inferences
scatter_plot = px.scatter(
    df,
    x = "petal width (cm)", y = "sepal width (cm)",
    color = "label",
    title = "Petal width vs Sepal width",
    symbol = "label",
    marginal_x = "box",
    marginal_y = "box",
    height = 800,
    width = 1400
)
scatter_plot.show()

In [18]:
 # Visual Understanding of the Statical inferences
density_plot = px.density_contour(
    df,
    x = "petal width (cm)", y = "sepal width (cm)",
    color = "label",
    title = "Petal width vs Sepal width",
    marginal_x = "box",
    marginal_y = "box",
    height = 800,
    width = 1400
)
density_plot.show()

To my surprise, a similar property is possed by the petal width and the sepal width.

In [19]:
# Visual Understanding of the Statical inferences
scatter_plot = px.scatter_3d(
    df,
    x = "petal width (cm)", y = "sepal width (cm)", z = "petal length (cm)",
    color = "label",
    title = "Petal width vs Sepal width vs Petal length",
    symbol = "label",
)
scatter_plot.show()

In [20]:
 # Visual Understanding of the Statical inferences
scatter_plot = px.scatter_3d(
    df,
    x = "petal width (cm)", y = "sepal width (cm)", z = "sepal length (cm)",
    color = "label",
    title = "Petal width vs Sepal width vs Sepal length",
    symbol = "label",
)
scatter_plot.show()

In the 3D scatter plots, Setosa class is distinctly separable, presenting clear boundaries. However, Versicolor and Virginica exhibit some overlap, particularly in a few data points. This overlap could pose challenges in classification accuracy, as distinguishing between Versicolor and Virginica might be less precise compared to Setosa due to the proximity of these classes in some areas of the feature space.

In [21]:
# Calculate the Spearman Correlation
# Drop non-numerical columns before calculating correlation
corr = df.drop('label', axis=1).corr(method="spearman")

# Visualize the Correlation Heatmap
corr_heatmap = px.imshow(corr, text_auto=True, title="Spearman Correlation")
corr_heatmap.show()

Indeed, the observed relationships between these features and the target label align with the correlation matrix findings. This correlation analysis affirms the associations noticed in the various plots, offering a consolidated view of the relationships without categorizing the data based on the label.

# **ML Models**

Prior to advancing further, a crucial step involves meticulous data preprocessing to ensure compatibility with our machine learning models.

Firstly, we'll partition the dataset into separate training and testing sets, essential for model evaluation. Additionally, the disparate scales among the features necessitate scaling to establish uniformity, a practice known to enhance model performance.

In [22]:
# Applying Standard Scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)

In [23]:
# Splitting data into training and testing set
x_train, x_test, y_train, y_test = train_test_split(
    X_scaled, targets,
    shuffle = True,
    random_state = 42,
    train_size = 0.9,
    test_size = 0.1
)

With the successful application of data preprocessing steps, our dataset is now primed for compatibility with various machine learning models.

Having standardized the data and made it conducive for model interpretation, we're poised to initiate the crucial phases of training and testing our predictive model.

In [24]:
def metrics(true, pred):
    p = precision_score(true, pred, average='macro')
    r = recall_score(true, pred, average='macro')
    f1 = f1_score(true, pred, average='macro')
    acc = accuracy_score(true, pred)
    return p, r, f1, acc

In [25]:
# Record all the training and validation scores
model_names = []

train_precisions = []
test_precisions = []

train_recalls = []
test_recalls = []

train_f1s = []
test_f1s = []

train_accuracy_score = []
test_accuracy_score = []

In [26]:
# Initialize and train the models
models = {
    "SVM": SVC(),
    "XGBoost": XGBClassifier(),
    "GaussianNB": GaussianNB(),
    "DecisionTree": DecisionTreeClassifier(),
    "RandomForest": RandomForestClassifier(),
    "LogisticRegression": LogisticRegression()
}

for model_name, model in models.items():
    # Training the model
    model.fit(x_train, y_train)

    # Predictions
    train_pred = model.predict(x_train)
    test_pred = model.predict(x_test)

    # Calculating metrics
    train_p, train_r, train_f1, train_acc = metrics(y_train, train_pred)
    test_p, test_r, test_f1, test_acc = metrics(y_test, test_pred)

    # Appending metrics to lists
    model_names.append(model_name)
    train_precisions.append(train_p)
    test_precisions.append(test_p)
    train_recalls.append(train_r)
    test_recalls.append(test_r)
    train_f1s.append(train_f1)
    test_f1s.append(test_f1)
    train_accuracy_score.append(train_acc)
    test_accuracy_score.append(test_acc)

In [27]:
# # Print the appended results for each model
# for i in range(len(model_names)):
#     print("Model Name :", model_names[i])
#     print("Train Precision :", train_precisions[i])
#     print("Test Precision  :", test_precisions[i])
#     print("Train Recall  :", train_recalls[i])
#     print("Test Recall  :", test_recalls[i])
#     print("Train F1-score  :", train_f1s[i])
#     print("Test F1-score  :", test_f1s[i])
#     print("Train Accuracy  :", train_accuracy_score[i])
#     print("Test Accuracy  :", test_accuracy_score[i])
#     print("\n")

Given the numerous models we're comparing and the multitude of matrices to compute, it's efficient to execute these evaluations in a single run rather than computing them individually for each model.

Should you prefer a straightforward raw format for comparison, you can easily uncomment the previous code. However, for a more insightful and visual analysis of the models, let's proceed further for a visual comparison.

In [28]:
# Create a DataFrame for model evaluations
model_evals = pd.DataFrame(data={
    "Name": model_names,
    "Train Precision": train_precisions,
    "Test Precision": test_precisions,
    "Train Recall": train_recalls,
    "Test Recall": test_recalls,
    "Train F1-score": train_f1s,
    "Test F1-score": test_f1s,
    "Train Accuracy": train_accuracy_score,
    "Test Accuracy": test_accuracy_score
})

In [29]:
# Train Precision Bar Graph
train_precision_bar = px.bar(
    model_evals,
    x = "Name", y = "Train Precision",
    title = "Train Precision Bar Graph",
    color="Name"
)
train_precision_bar.update_layout(showlegend=False)
train_precision_bar.show()

# Test Precision Bar Graph
test_precision_bar = px.bar(
    model_evals,
    x = "Name", y = "Test Precision",
    title = "Test Precision Bar Graph",
    color="Name"
)
test_precision_bar.update_layout(showlegend=False)
test_precision_bar.show()

Considering the compact size of our dataset, most models exhibit nearly 100% precision in both training and testing segments, indicating a potential risk of overfitting.

However, my focus lies on three models that deviate from this pattern: the Support Vector Machine, Gaussian Naive Bayes, and Logistic Regression. While these models do not achieve 100% precision in training, they do showcase perfect precision in testing. This might showpiece the fact that they are slightly more robust.

In [30]:
# Train Recall Bar Graph
train_Recall_bar = px.bar(
    model_evals,
    x = "Name", y = "Train Recall",
    title = "Train Recall Bar Graph",
    color="Name"
)
train_Recall_bar.update_layout(showlegend=False)
train_Recall_bar.show()

# Test Recall Bar Graph
test_Recall_bar = px.bar(
    model_evals,
    x = "Name", y = "Test Recall",
    title = "Test Recall Bar Graph",
    color="Name"
)
test_Recall_bar.update_layout(showlegend=False)
test_Recall_bar.show()

This seems to be an exact replica of the previous plot.

In [31]:
# Train F1-score Bar Graph
train_f1_score_bar = px.bar(
    model_evals,
    x = "Name", y = "Train F1-score",
    title = "Train F1-score Bar Graph",
    color="Name"
)
train_f1_score_bar.update_layout(showlegend=False)
train_f1_score_bar.show()

# Test F1-score Bar Graph
test_f1_score_bar = px.bar(
    model_evals,
    x = "Name", y = "Test F1-score",
    title = "Test F1-score Bar Graph",
    color="Name"
)
test_f1_score_bar.update_layout(showlegend=False)
test_f1_score_bar.show()

In [32]:
# Train Accuracy Bar Graph
train_accuracy_bar = px.bar(
    model_evals,
    x = "Name", y = "Train Accuracy",
    title = "Train Accuracy Bar Graph",
    color="Name"
)
train_accuracy_bar.update_layout(showlegend=False)
train_accuracy_bar.show()

# Test Accuracy Bar Graph
test_accuracy_bar = px.bar(
    model_evals,
    x = "Name", y = "Test Accuracy",
    title = "Test Accuracy Bar Graph",
    color="Name"
)
test_accuracy_bar.update_layout(showlegend=False)
test_accuracy_bar.show()

All the graphs appear to be **uniform across the models**, making it **challenging to select a definitive choice**. This conundrum stems from **two key reasons.**

Firstly, the **models generally perform** well, which might suggest **any model could suffice.**

However, the **second reason complicates matters.** The **dataset's size** is **exceptionally small**, making it **difficult to ascertain one model's superiority over another.**


While **machine learning techniques** effectively **capture distinctions between classes** in **scatter plots,** revealing **overlaps between Virginica and Versicolor**, models achieving **100% precision** or **recall, accuracy** might indicate **overfitting, particularly in these overlapping regions.**