<a href="https://colab.research.google.com/github/WDSEatBNL/Intro-to-Machine-Learning-and-AI-Code/blob/master/Machine_Learning_SciKit_larger_set.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import all necessary Python libraries required for the project.

These include `skimage` for image processing, `matplotlib.pyplot` for plotting, `numpy` for numerical operations, `joblib` for saving/loading Python objects, `os` for interacting with the operating system, `collections.Counter` for counting items, `sklearn` modules for machine learning (SGDClassifier, StandardScaler, train_test_split), and `seaborn` for enhanced visualizations.

In [None]:
import skimage as skimage
import matplotlib.pyplot as plt
import numpy as np
import joblib
import os
from collections import Counter
from skimage import io
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from skimage.feature import hog
from sklearn.metrics import confusion_matrix
import seaborn as sns

Use the `!git clone` command to download a GitHub repository named 'Intro-to-Machine-Learning-and-AI-Files'. This repository is expected to contain the image datasets used for training and testing the machine learning model.

In [None]:
!git clone https://github.com/WDSEatBNL/Intro-to-Machine-Learning-and-AI-Files

Define a function `load_and_preprocess_images` that reads image files, extracts their labels (based on subfolder names), and stores the image data, filenames, and labels into a dictionary. It then saves this dictionary to a `.pkl` file using `joblib`.

The function is used twice: once for the 'bigdata' (training) set and once for the 'test' set.

Finally, it prints the number of images and label counts for both datasets and prepares the `X_train`, `y_train`, `X_test`, and `y_test` numpy arrays.

In [None]:
def load_and_preprocess_images(base_dir, pkl_filename):
    data = dict()
    data['label'] = []
    data['filename'] = []
    data['data'] = []

    for subdir in os.listdir(base_dir):
        current_path = os.path.join(base_dir, subdir)
        if not os.path.isdir(current_path):
            continue

        for filename in os.listdir(current_path):
            filepath = os.path.join(current_path, filename)
            image = io.imread(filepath)
            data['label'].append(subdir)
            data['filename'].append(filename)
            data['data'].append(image)

    joblib.dump(data, pkl_filename)
    return data

pklname = "bigdata.pkl"
pklname_test = "testdata.pkl"

data_path = r'./Intro-to-Machine-Learning-and-AI-Files/bigdata'
data = load_and_preprocess_images(data_path, pklname)

test_path = r'./Intro-to-Machine-Learning-and-AI-Files/test'
test_data = load_and_preprocess_images(test_path, pklname_test)

base_name = 'bigdata'
width = 288

print('number of training images: ', len(data['data']))
print(Counter(data['label']))
print('number of test images: ', len(test_data['data']))

labels = np.unique(data['label'])

X_train = np.array(data['data'])
y_train = np.array(data['label'])
X_test = np.array(test_data['data'])
y_test = np.array(test_data['label'])

Define two custom `Transformer` classes: `RGB2GrayTransformer` converts RGB images to grayscale, and `HogTransformer` extracts Histogram of Oriented Gradients (HOG) features from images.

These transformers are then represented along with a `StandardScaler`. The training and test image data (`X_train`, `X_test`) are preprocessed sequentially: first converted to grayscale, then HOG features are extracted, and finally, the features are scaled.

An `SGDClassifier` model is then initialized and trained on the preprocessed training data (`X_train_prepared`).

In [None]:
class RGB2GrayTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return np.array([skimage.color.rgb2gray(img) for img in X])

class HogTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, y=None, orientations=9,
                 pixels_per_cell=(8, 8),
                 cells_per_block=(3, 3), block_norm='L2-Hys'):
        self.y = y
        self.orientations = orientations
        self.pixels_per_cell = pixels_per_cell
        self.cells_per_block = cells_per_block
        self.block_norm = block_norm

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):

        def local_hog(X):
            return hog(X,
                       orientations=self.orientations,
                       pixels_per_cell=self.pixels_per_cell,
                       cells_per_block=self.cells_per_block,
                       block_norm=self.block_norm)

        try:
            return np.array([local_hog(img) for img in X])
        except:
            return np.array([local_hog(img) for img in X])

grayify = RGB2GrayTransformer()
hogify = HogTransformer(pixels_per_cell=(14, 14), cells_per_block=(2,2), orientations=9, block_norm='L2-Hys')
scalify = StandardScaler()

X_train_gray = grayify.fit_transform(X_train)
X_train_hog = hogify.fit_transform(X_train_gray)
X_train_prepared = scalify.fit_transform(X_train_hog)

sgd_clf = SGDClassifier(random_state=42, max_iter=1000, tol=1e-3)
sgd_clf.fit(X_train_prepared, y_train)

X_test_gray = grayify.transform(X_test)
X_test_hog = hogify.transform(X_test_gray)
X_test_prepared = scalify.transform(X_test_hog)

Use the trained `SGDClassifier` (`sgd_clf`) to make predictions on the preprocessed test data (`X_test_prepared`).

Then print the true labels (`y_test`), a boolean array indicating whether each prediction matches the true label, and the array of predicted labels (`y_pred`).

In [None]:
y_pred = sgd_clf.predict(X_test_prepared)
print(y_test)
print(np.array(y_pred == y_test))
print(y_pred)

Show each validation image with its predicted category

In [None]:
fig = plt.figure(figsize=(10, 10))

for i in range(len(y_pred)):
    if i >= 30:
        break
    ax = plt.subplot(6, 5, i + 1)
    plt.imshow(X_test[i])
    plt.title(y_pred[i])
    plt.axis("off")
plt.tight_layout()
plt.show()

Print out the accuracy of the model in the form of percent correctly identified

In [None]:
print('Percentage correct: ', 100*np.sum(y_pred == y_test)/len(y_test))

Generate and display a confusion matrix to evaluate the performance of the classification model in more detail.

Use `sklearn.metrics.confusion_matrix` to compute the matrix, normalizes it to show proportions, and then visualizes it as a heatmap using `seaborn.heatmap`.

The heatmap shows the proportion of true labels versus predicted labels for each class ('bird', 'cat', 'dog'), helping to identify which classes are being confused by the model.

In [None]:
cm = confusion_matrix(y_test, y_pred, normalize='all')
cm_rounded = np.around(cm, decimals=2)
fig, ax_cm = plt.subplots(figsize=(8, 6))
sns.heatmap(cm_rounded, annot=True, annot_kws={"size": 20}, cbar=False, cmap='Blues', xticklabels=labels, yticklabels=labels)
ax_cm.set_ylabel('True Values', fontsize=20)
ax_cm.set_xlabel('Predicted Values', fontsize=20)
ax_cm.set_title('Confusion Matrix', fontsize=20)
ax_cm.tick_params(axis='x', labelsize=20)
ax_cm.tick_params(axis='y', labelsize=20)
plt.tight_layout()
plt.show()