<a href="https://colab.research.google.com/github/VicentePina7210/DataMiningCleaningExercise/blob/main/PerformanceValidationExercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [51]:
import os
import numpy as np
import pandas as pd
import kagglehub
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.model_selection import train_test_split

In [52]:
# Download and load dataset
path = kagglehub.dataset_download("himanshunakrani/iris-dataset")
data_df = pd.read_csv(os.path.join(path, "iris.csv"))
data_df.head()
x = data_df['species'].unique()
print(x)

['setosa' 'versicolor' 'virginica']


In [53]:
# Implement a train test split
# (your code here)
x_train = data_df.drop(columns = ['species'])
y_train = data_df['species']

x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.2, random_state=42)


In [54]:
# Implement the following functions
def accuracy(y_true, y_pred):
  # (your code here)
  # Find the ratio of correct predictions to total predictions
  # The sum method counts the amount of true values in this statement
  correct = sum(y_t == y_p for y_t, y_p in zip(y_true, y_pred))
  return correct / len(y_true)

def precision(y_true, y_pred):
    # For multiclass, calculate precision for each class and average
    classes = np.unique(y_true)
    precisions = []

    for cls in classes:
        true_positives = sum((y_t == cls and y_p == cls) for y_t, y_p in zip(y_true, y_pred))
        predicted_positives = sum(y_p == cls for y_p in y_pred)
        precision_cls = true_positives / predicted_positives if predicted_positives != 0 else 0
        precisions.append(precision_cls)

    return np.mean(precisions)

def recall(y_true, y_pred):
    # For multiclass, calculate recall for each class and average
    classes = np.unique(y_true)
    recalls = []

    for cls in classes:
        true_positives = sum((y_t == cls and y_p == cls) for y_t, y_p in zip(y_true, y_pred))
        actual_positives = sum(y_t == cls for y_t in y_true)
        recall_cls = true_positives / actual_positives if actual_positives != 0 else 0
        recalls.append(recall_cls)

    return np.mean(recalls)

def f1_score(y_true, y_pred):
    # For multiclass, calculate F1 score for each class and average
    prec = precision(y_true, y_pred)
    rec = recall(y_true, y_pred)
    return 2 * (prec * rec) / (prec + rec) if (prec + rec) != 0 else 0

def performance_metrics(y_true, y_pred):
  # A single function to preint all performance metrics
  # (your code here)
    print("Accuracy:", accuracy(y_true, y_pred))
    print("Precision:", precision(y_true, y_pred))
    print("Recall:", recall(y_true, y_pred))
    print("F1 Score:", f1_score(y_true, y_pred))
    return

In [88]:
# Train a logistic regression model and print all performance metrics on the training set
# (your code here)
# Split the data into training and testing sets
X = data_df.drop(columns=['species'])  # Assuming 'species' is the target column
y = data_df['species']

# Use train_test_split to split data (though only training data will be evaluated here)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
model = LogisticRegression(max_iter=200)  # max_iter is set higher to ensure convergence
model.fit(X_train, y_train)

# Make predictions on the training set
y_pred_train = model.predict(X_train)

# Print all performance metrics for the training set using the functions you defined
print("Performance Metrics on the Training Set:")
performance_metrics(y_train, y_pred_train)  # Test


Performance Metrics on the Training Set:
Accuracy: 0.975
Precision: 0.9761904761904763
Recall: 0.975609756097561
F1 Score: 0.9759000297530497


In [98]:
# Implement k-fold cross validation
x = data_df.drop(columns = ['species'])
y = data_df['species']

def k_fold_cross_validation(X, y, k):
    fold_size = len(X) // k
    fold = 1

    # Manually splitting data into k folds
    for i in range(k):
        # Creating validation and training sets for each fold
        start = i * fold_size
        end = start + fold_size
        X_val = X[start:end]
        y_val = y[start:end]

        # Combine the other folds for training
        X_train = pd.concat([X[:start], X[end:]], axis=0)
        y_train = pd.concat([y[:start], y[end:]], axis=0)

        # Initialize and train the logistic regression model
        model = LogisticRegression(max_iter=200)
        model.fit(X_train, y_train)

        # Make predictions on both the training and validation sets
        y_pred_train = model.predict(X_train)
        y_pred_val = model.predict(X_val)

        # Print performance metrics for both training and validation sets
        print(f"Fold {fold} Performance Metrics:")

        print("Training Set Metrics:")
        performance_metrics(y_train, y_pred_train)  # Training set metrics

        print("Validation Set Metrics:")
        performance_metrics(y_val, y_pred_val)  # Validation set metrics
        print() # Line for readability

        fold += 1




In [111]:
# Train a logistic regression model and print all performance metrics on the training set AND validation set for each fold
k_fold_cross_validation(x, y, k=10)

Fold 1 Performance Metrics:
Training Set Metrics:
Accuracy: 0.9703703703703703
Precision: 0.9738247863247862
Recall: 0.9733333333333333
F1 Score: 0.9735789978089187
Validation Set Metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0

Fold 2 Performance Metrics:
Training Set Metrics:
Accuracy: 0.9703703703703703
Precision: 0.9738247863247862
Recall: 0.9733333333333333
F1 Score: 0.9735789978089187
Validation Set Metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0

Fold 3 Performance Metrics:
Training Set Metrics:
Accuracy: 0.9703703703703703
Precision: 0.9738247863247862
Recall: 0.9733333333333333
F1 Score: 0.9735789978089187
Validation Set Metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0

Fold 4 Performance Metrics:
Training Set Metrics:
Accuracy: 0.9703703703703703
Precision: 0.9719973009446695
Recall: 0.9683333333333334
F1 Score: 0.9701618577650114
Validation Set Metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0

Fold 5 Performance M

In [109]:
# Try different values of K. How do the performance metrics vary based on different values of K?

# How do you interpret the observations from the previous question


Training Set Class Distribution:
species
versicolor    41
setosa        40
virginica     39
Name: count, dtype: int64
Validation Set Class Distribution:


NameError: name 'y_val' is not defined

At k = 3 the model does nto perform very well and because the validation accuracy precision, recall and f1 score just get set to 0. This could possibly be because there is not enough data to make an accurate prediction

at k = 5 we do not see much variance between the accuracy and the precision as well as in the validation which leads me to believe the model is performing very well and may be the right size for this dataset.

k = 10 does not show much variance either, however the output of many of the metrics showing 100% which makes me skeptical of the accuracy, this could be because the dataset is overfitting and not learning the general patterns that would apply to new data

