In [1]:
# Step 1: Import necessary libraries
import pandas as pd

# Step 2: Load the dataset
df = pd.read_csv('/kaggle/input/irissss/iris_synthetic.csv')

# Step 3: Display first few rows
df.head()


Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.248357,3.125246,3.085502,0.96866,Iris-setosa
1,4.930868,3.173224,3.219909,1.477571,Iris-versicolor
2,5.323844,2.659988,3.873647,0.507137,Iris-virginica
3,5.761515,3.116127,3.805185,1.252023,Iris-setosa
4,4.882923,3.146536,3.489549,0.734871,Iris-versicolor


## Check for missing values

Encode the Species (target column) into numeric labels for the Naïve Bayes algorithm

In [2]:
# Step 1: Check for missing values
print(df.isnull().sum())

# Step 2: Encode the target labels (Species)
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Species'] = le.fit_transform(df['Species'])  # Converts species to 0,1,2

# Step 3: Show updated DataFrame
df.head()


SepalLength    0
SepalWidth     0
PetalLength    0
PetalWidth     0
Species        0
dtype: int64


Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.248357,3.125246,3.085502,0.96866,0
1,4.930868,3.173224,3.219909,1.477571,1
2,5.323844,2.659988,3.873647,0.507137,2
3,5.761515,3.116127,3.805185,1.252023,0
4,4.882923,3.146536,3.489549,0.734871,1


## Train Naïve Bayes Classifier
We'll:

Split the dataset into training and testing sets.

Train a Gaussian Naïve Bayes model.

Make predictions on the test set.
Naive Bayes is a probabilistic model. It means the model calculates the probability of each species given the features (measurements) and then chooses the species with the highest probability. naive because it assumes that all features are independent



In [3]:
# Step 1: Split features and target
X = df.drop('Species', axis=1)   # Features
y = df['Species']                # Target

# Step 2: Train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Train the Naïve Bayes model
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)

# Step 4: Make predictions
y_pred = model.predict(X_test)

# Step 5: Display predictions
print("Predictions:", y_pred)


Predictions: [2 1 1 0 1 2 2 2 0 0 1 2 0 0 1 2 1 0 0 2 2 1 2 1 0 2 0 0 0 0 0 0 2 2 0 2 0
 2 0 1 1 0 1 2 1]


## Generate the confusion matrix.

Manually extract TP, FP, TN, FN.

Compute Accuracy, Error rate, Precision, and Recall.



In [4]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

# Step 1: Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# For multi-class, TP/FP/TN/FN are computed differently.
# But we can calculate per-class metrics using macro averaging:

# Step 2: Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Step 3: Error rate
error_rate = 1 - accuracy
print("Error Rate:", error_rate)

# Step 4: Precision
precision = precision_score(y_test, y_pred, average='macro')  # macro: avg across classes
print("Precision:", precision)

# Step 5: Recall
recall = recall_score(y_test, y_pred, average='macro')
print("Recall:", recall)


Confusion Matrix:
 [[10  1  4]
 [ 5  6  5]
 [ 3  5  6]]
Accuracy: 0.4888888888888889
Error Rate: 0.5111111111111111
Precision: 0.4851851851851852
Recall: 0.49007936507936506


## First row: Actual Setosa (row 1)
10: The model correctly predicted 10 Setosa flowers as Setosa (True Positive for Setosa).

1: The model incorrectly predicted 1 Versicolor as Setosa (False Positive for Setosa).

4: The model incorrectly predicted 4 Virginica as Setosa (False Positive for Setosa).

Setosa is our positive class (flowers we want to identify).

Not Setosa is the negative class (all other types of flowers).


1. True Positive (TP)
True Positive means the model correctly predicted Setosa as Setosa.

Example: You have 10 Setosa flowers, and your model correctly predicts all 10 as Setosa.

2. False Positive (FP)
False Positive means the model incorrectly predicted Setosa when the flower was actually Not Setosa (it made a mistake).

Example: You have 5 non-Setosa flowers, and the model mistakenly predicts them as Setosa.

3. True Negative (TN)
True Negative means the model correctly predicted Not Setosa as Not Setosa.

Example: You have 15 non-Setosa flowers, and your model correctly predicts 15 of them as Not Setosa.

4. False Negative (FN)
False Negative means the model incorrectly predicted Not Setosa when the flower was actually Setosa.

Example: You have 10 Setosa flowers, and the model mistakenly predicts 3 of them as Not Setosa.


Accuracy is the percentage of correct predictions made by the model.

Formula:

Accuracy=Number of Correct Predictions/Total Number of Predictions


Error Rate
Error Rate is the percentage of incorrect predictions made by the model. It is simply the opposite of accuracy.

Formula:

Error Rate=1−Accuracy


Precision
Precision measures how many of the predicted positive cases (for each class) were actually correct. In simple terms, it’s about how many of the predicted flowers of a certain species are actually that species.

Formula:

Precision=True Positives/(True Positives + False Positives)


Recall
Recall (also called sensitivity) measures how many of the actual positive cases (for each class) were correctly predicted. In other words, it answers: How many of the actual Setosa flowers did the model correctly identify as Setosa?

Formula:

Recall=True Positives/(True Positives + False Negatives)


