<a href="https://colab.research.google.com/github/achilela/streamlit-agent/blob/main/CMM560_Experiment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Workflow**

Set Google Colab Runtime type to GPU, check if GPU is being used with !nvidia-smi command.

1. Load the SURFACE dataset for classification and split it into train and test sets
   * Define paths to the datasets, including the path to the tumor and
   * Display a random image each from the training and test set
   * Convert class names (surface-positive, surface-negative) to numerical labels
   * Display two images with their class names and integers labels

2. Load a image processor to process images
   * Apply data normalization
   * Convert images into tensor

3.  Apply data augmentation strategies
    * Write a function to apply random flip/rotate augmentations to train, test, and validate data

4. Choose any two of the non-neural network classifiers from the list below:


   *   **Support Vector Machines (SVM)**: SVMs find a hyperplane that best separates data points into different classes. They work well for both linear and non-linear problems.
   * **Decision Trees and Random Forests**: Decision trees split data based on
   feature values to create a tree-like structure. Random forests combine multiple decision trees to improve accuracy and reduce overfitting.
   * **Naive Bayes Classifier (NB)**: NB is based on Bayes' theorem and assumes that features are conditionally independent. It's particularly useful for text classification and spam filtering.
   * **K-Nearest Neighbors (KNN)**: KNN classifies data points based on the majority class of their nearest neighbors. It's simple but effective for both regression and classification tasks.
   * **Ensemble methods (e.g., AdaBoost, Gradient Boosting)**: These combine multiple base classifiers to create a stronger model. AdaBoost and Gradient Boosting are popular ensemble techniques.

5. Implement classification accuracy metric
    * Write a function to compute module accuracy by computing predictions and ground truth labels as well as reporting the results in terms of precision, recall, f1-score runtime, ROC and a confusion-matrix


In [None]:
!nvidia-smi

Tue Jul 23 16:31:05 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install datasets

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive



1.   Define paths to the datasets, including the path to the SURFACE and UNDERWATER

2.   Display a random image each from the training and test set

3.   Convert class names (positive, negative) to numerical labels

4.   Display two images with their class names and integers labels


2. Choose at least three non-neural classifiers from the list above and for each of the classifier:
   * build a code to classify the images into Positive or Negative
   * provide code benchmark to report the results of all chosen three models using precision, recall and F1-score
   * provide a reflection around which one of the best models and cost effective justifications on the case and possible implementations




In [None]:

import os
import numpy as np
import cv2
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score

# Define the dataset paths
train_dir = '/content/drive/MyDrive/data/Surface/train'
test_dir = '/content/drive/MyDrive/data/Surface/test'

# Function to load images and labels
def load_images_from_folder(folder):
    images = []
    labels = []
    for class_name in os.listdir(folder):
        class_folder = os.path.join(folder, class_name)
        if os.path.isdir(class_folder):
            for filename in os.listdir(class_folder):
                img_path = os.path.join(class_folder, filename)
                img = cv2.imread(img_path)
                if img is not None:
                    img = cv2.resize(img, (256, 256))  # Resize image to 256x256
                    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)  # Convert to grayscale
                    images.append(img.flatten())  # Flatten the image
                    labels.append(class_name)
    return np.array(images), np.array(labels)

# Load training and test data
train_images, train_labels = load_images_from_folder(train_dir)
test_images, test_labels = load_images_from_folder(test_dir)

# Split the training data into a training and validation set
X_train, X_val, y_train, y_val = train_test_split(train_images, train_labels, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
test_images = scaler.transform(test_images)

# Train and evaluate SVM classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
val_predictions_svm = svm_classifier.predict(X_val)
test_predictions_svm = svm_classifier.predict(test_images)
print("SVM Validation Accuracy:", accuracy_score(y_val, val_predictions_svm))
print("SVM Validation Classification Report:\n", classification_report(y_val, val_predictions_svm))
print("SVM Test Accuracy:", accuracy_score(test_labels, test_predictions_svm))
print("SVM Test Classification Report:\n", classification_report(test_labels, test_predictions_svm))

# Train and evaluate RandomForest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
val_predictions_rf = rf_classifier.predict(X_val)
test_predictions_rf = rf_classifier.predict(test_images)
print("RandomForest Validation Accuracy:", accuracy_score(y_val, val_predictions_rf))
print("RandomForest Validation Classification Report:\n", classification_report(y_val, val_predictions_rf))
print("RandomForest Test Accuracy:", accuracy_score(test_labels, test_predictions_rf))
print("RandomForest Test Classification Report:\n", classification_report(test_labels, test_predictions_rf))

# Train and evaluate NaiveBayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
val_predictions_nb = nb_classifier.predict(X_val)
test_predictions_nb = nb_classifier.predict(test_images)
print("NaiveBayes Validation Accuracy:", accuracy_score(y_val, val_predictions_nb))
print("NaiveBayes Validation Classification Report:\n", classification_report(y_val, val_predictions_nb))
print("NaiveBayes Test Accuracy:", accuracy_score(test_labels, test_predictions_nb))
print("NaiveBayes Test Classification Report:\n", classification_report(test_labels, test_predictions_nb))

In [25]:
from sklearn.model_selection import cross_val_score

svm_classifier = SVC(kernel='linear')
svm_scores = cross_val_score(svm_classifier, train_images, train_labels, cv=5)
print("SVM 5-Fold Cross-Validation Scores:", svm_scores)
print("SVM 5-Fold Cross-Validation Mean Accuracy:", np.mean(svm_scores))

SVM 5-Fold Cross-Validation Scores: [0.84745763 0.85875706 0.84745763 0.8700565  0.83050847]
SVM 5-Fold Cross-Validation Mean Accuracy: 0.8508474576271187


In [26]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_scores = cross_val_score(rf_classifier, train_images, train_labels, cv=5)
print("RandomForest 5-Fold Cross-Validation Scores:", rf_scores)
print("RandomForest 5-Fold Cross-Validation Mean Accuracy:", np.mean(rf_scores))

RandomForest 5-Fold Cross-Validation Scores: [0.88700565 0.8700565  0.88700565 0.89265537 0.89830508]
RandomForest 5-Fold Cross-Validation Mean Accuracy: 0.8870056497175142


In [27]:
nb_classifier = GaussianNB()
nb_scores = cross_val_score(nb_classifier, train_images, train_labels, cv=5)
print("NaiveBayes 5-Fold Cross-Validation Scores:", nb_scores)
print("NaiveBayes 5-Fold Cross-Validation Mean Accuracy:", np.mean(nb_scores))

NaiveBayes 5-Fold Cross-Validation Scores: [0.82485876 0.73446328 0.64971751 0.73446328 0.61581921]
NaiveBayes 5-Fold Cross-Validation Mean Accuracy: 0.711864406779661
