Resources used for completion of this task:

    - Coursera. (n.d.). Medical Insurance Premium Prediction with Machine Learning [Online course].
      Retrieved February 10, 2025,
      from https://www.coursera.org/learn/medical-insurance-premium-prediction-with-machine-learning
    - Coursera. (n.d.). Logistic Regression with NumPy and Python [Online course].
      Retrieved February 10, 2025,
      from https://www.coursera.org/learn/logistic-regression-numpy-python
    - Liu, Y. (Hayden). (2020). Python machine learning by example: Unlock machine learning best practices with
      real-world use cases (3rd ed.). Packt Publishing.


In [45]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.metrics import accuracy_score, precision_score, recall_score



data = pd.read_csv('iris.csv')
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


Encode your dependent variable y such that Iris-setosa is encoded
as 0, and Iris-versicolour and Iris-virginica are both encoded as 1
Here, 0 corresponds to the Iris-setosa class, and 1 corresponds to
the not-Iris-setosaclass.

In [46]:
# Check for uniques values at species column
data['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [47]:
# Set the values of species as 0 or 1
#data['Species'] = data['Species'].map({'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 1})
data['Species'] = (data['Species'] != 'Iris-setosa').astype(int)
data['Species'].value_counts(normalize=True)


Species
1    0.666667
0    0.333333
Name: proportion, dtype: float64

In [48]:
# Dropping Id column
data = data.drop(columns = ['Id'])
data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Split the data into a training and test set.

In [49]:
# Assign to X and y the independent and dependent variables
X = data.drop(columns = ['Species']).to_numpy()
y = data['Species'].values


In [50]:
# Getting the train and test data split into 80%-20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print("Training data:",X_train.shape, y_train.shape)
print("Test data:",X_test.shape, y_test.shape)

Training data: (120, 4) (120,)
Test data: (30, 4) (30,)


Use the LogisticRegression class from scikit-learn to build and train a logistic regression model on the training dataset. Then, use the trained model to predict the outcomes on the test dataset.

In [51]:
# Scaler the data
scaler = StandardScaler()
X_train_standard = scaler.fit_transform(X_train)
X_test_standard = scaler.transform(X_test)

# Apply to the model a logistic regression
model = LogisticRegression()
model.fit(X_train_standard, y_train)
y_pred = model.predict(X_test_standard)

print("Predictions on test data:", y_pred)

Predictions on test data: [1 0 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 0 1 1 1 1 1 0 0]


Use scikit-learn to generate a confusion matrix that compares the
predicted labels to the actual labels (gold labels).

In [41]:
# Map numeric labels back to species names for the confusion matrix
classes = ['Iris-setosa', 'Non Iris-setosa']

# Generate the confusion matrix (use encoded labels 0 and 1)
conf_matrix = confusion_matrix(y_test, y_pred, labels=[0, 1])

# Convert the confusion matrix to a DataFrame (only include the relevant classes)
cm_df = pd.DataFrame(conf_matrix, columns=classes, index=classes)

# Show the confusion matrix
print("Confusion Matrix:")
print(cm_df)



Confusion Matrix:
                 Iris-setosa  Non Iris-setosa
Iris-setosa               10                0
Non Iris-setosa            0               20


Model Performance Analysis

Upon reviewing the confusion matrix, we can analyze the model's precision and recall:

-Precision: Precision will be higher if the model does not frequently misclassify Iris-versicolor and Iris-virginica as Iris-setosa. That is, if the model is conservative in predicting Iris-setosa and only makes a prediction when confident.

-Recall: Recall will be higher if the model identifies most instances of Iris-setosa without missing many true Iris-setosa cases. However, this could come at the cost of lower precision if the model is too lenient and incorrectly classifies non-Iris-setosa as Iris-setosa.

Given that Iris-setosa is the minority class, the model is more likely to predict the dominant class (Iris-versicolor and Iris-virginica combined). As a result, the model is likely to exhibit higher recall but lower precision for Iris-setosa. In other words, the model might miss some true Iris-setosa cases (low recall), but when it predicts Iris-setosa, it will be more accurate (high precision).


Instead of using sklearn’s built-in function, write your own code to
calculate accuracy, precision, and recall. Once you've calculated these
metrics manually, compare your results with those obtained from
scikit-learn to see if they align.

In [52]:
# Calculate metrics manually
TP = conf_matrix[0, 0]  # True Positives (Iris-setosa correctly predicted as Iris-setosa)
FP = conf_matrix[1, 0]  # False Positives (Non Iris-setosa incorrectly predicted as Iris-setosa)
FN = conf_matrix[0, 1]  # False Negatives (Iris-setosa incorrectly predicted as Non Iris-setosa)
TN = conf_matrix[1, 1]  # True Negatives (Non Iris-setosa correctly predicted as Non Iris-setosa)

# Accuracy calculation
accuracy_manual = (TP + TN) / (TP + FP + FN + TN)

# Precision calculation
precision_manual = TP / (TP + FP) if (TP + FP) > 0 else 0

# Recall calculation
recall_manual = TP / (TP + FN) if (TP + FN) > 0 else 0

# Display the manually calculated metrics
print("Manual Calculations:")
print("Accuracy (manual):", accuracy_manual)
print("Precision (manual):", precision_manual)
print("Recall (manual):", recall_manual)

# Compare with sklearn's built-in calculations
print("\nComparison with sklearn:")
print("Accuracy (sklearn):", accuracy_score(y_test, y_pred))
print("Precision (sklearn):", precision_score(y_test, y_pred))
print("Recall (sklearn):", recall_score(y_test, y_pred))

Manual Calculations:
Accuracy (manual): 1.0
Precision (manual): 1.0
Recall (manual): 1.0

Comparison with sklearn:
Accuracy (sklearn): 1.0
Precision (sklearn): 1.0
Recall (sklearn): 1.0
