# __Velon Murugathas__ 8938776
## Lab 6 - Logistic Regression

### Using SciKit-Learn, train a binary logistic regression model on the Iris dataset. Use all four features and define only 2 labels: virginica and non-virginica

In [15]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

iris = datasets.load_iris()                                                                     # Loading the Iris dataset
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

df['target'] = np.where(df['target'] == 2, 'virginica', 'non-virginica')                        # Create binary labels virginica and non-virginica

X = df.drop('target', axis=1)                                                                   # Splitting the data into features X and labels y
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)       # Splitting the data into training and testing sets

scaler = StandardScaler()                                                                      
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression(solver='liblinear')                                                  # Creating and training the logistic regression model
model.fit(X_train, y_train)

y_pred = model.predict(X_test)                                                                  # Make predictions on the test data

accuracy = accuracy_score(y_test, y_pred)                                                       # Evaluate the model's accuracy
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 1.00


### Evaluating the failure modes

In [16]:
misclassified_indices = np.where(y_pred != y_test)                                          # Finding the indices where the predicted labels do not match the actual labels

misclassified_instances = X_test[misclassified_indices]                                     # Extracting the feature data of the misclassified instances

misclassified_actual_labels = y_test.iloc[misclassified_indices]                            # Extracting the actual labels of the misclassified instances

misclassified_predicted_labels = y_pred[misclassified_indices]                              # Extracting the predicted labels of the misclassified instances

print("Misclassified Instances:")                                                           

misclassified_data = pd.DataFrame({'Actual Label': misclassified_actual_labels, 'Predicted Label': misclassified_predicted_labels})     # Creating a DataFrame to display the actual and predicted labels for misclassified instances

print(misclassified_data)                                                                   # Printing the DataFrame containing the misclassified instances and their labels


Misclassified Instances:
Empty DataFrame
Columns: [Actual Label, Predicted Label]
Index: []


__Explanation__

The instances that the model got wrong was checked and found none. The "Misclassified Instances" section shows an empty table, which means the model made no mistakes. It confirms that the model did an excellent job classifying "virginica" and "non-virginica" in the Iris dataset with high accuracy.

### Shared properties for this case

In [17]:

correctly_classified_indices = np.where(y_pred == y_test)                               # For separating correctly classified and misclassified instances
correctly_classified_instances = X_test[correctly_classified_indices]

if len(correctly_classified_indices[0]) > 0:                                            # To check if there are any correctly classified instances
    mean_correctly_classified = np.mean(correctly_classified_instances, axis=0)         # Calculating the mean and standard deviation of feature values for correctly classified instances
    std_correctly_classified = np.std(correctly_classified_instances, axis=0)
    
    print("Mean Feature Values - Correctly Classified:")
    print(mean_correctly_classified)
    print("\nStandard Deviation of Feature Values - Correctly Classified:")
    print(std_correctly_classified)
else:
    print("No correctly classified instances found.")


Mean Feature Values - Correctly Classified:
[ 0.20824055 -0.04844444  0.08977889  0.10678803]

Standard Deviation of Feature Values - Correctly Classified:
[1.01274741 0.84570921 1.03743674 1.0642496 ]


__Explanation__
The mean feature values for these instances are approximately [0.208, -0.048, 0.090, 0.107], and the standard deviation of feature values is around [1.013, 0.846, 1.037, 1.064]. These shared properties suggest that the model correctly classified instances with consistent average values and variability in their feature attributes. This implies that the model identified specific characteristics indicative of the "virginica" and "non-virginica" classes in the Iris dataset.

### Accuracy and Confusion metric

In [18]:
accuracy = accuracy_score(y_test, y_pred)                               # Calculating the accuracy
print(f"Accuracy: {accuracy:.2f}")

confusion = confusion_matrix(y_test, y_pred)                            # Creating a confusion matrix
print("Confusion Matrix:")
print(confusion)

Accuracy: 1.00
Confusion Matrix:
[[19  0]
 [ 0 11]]


__Explanation__

The binary logistic regression model reports a perfect accuracy score of 1.00, which indicates that the model correctly classifies all instances in the test dataset. The confusion matrix supports the model's success by showing that there were no misclassifications. It correctly identified 19 instances as "non-virginica" and 11 instances as "virginica," without making any mistakes in classification.