# This jupyter notebook is 1 of 5 notebooks in building an AI model about detecting skin cancer and deploying that model via designing a web application

# *Online Dermatologists:* 📱 🌐 Diagnosing Skin Cancer through a Web Application

1 in 3 cancer patients have a form of skin cancer, making it the most prevalent form of cancer in the world. In the US alone, over 9,000 patients are diagnosed on a daily basis. Skin cancer hits rural and impoverished communities especially hard, as without access to professional healthcare workers and equipment, many cases go undetected, and proper care can't be administerd in time. Let's try to help out by creating a web app, that anyone can access through a phone or laptop, to have those suspicious moles checked out with ML!

In this project, we will be be diagnosing skin lesion images for signs of skin cancer. To perform this task, we'll be working with an array of machine learning methods and models. We'll also be developing a web app to deploy our machine learning models! From there, we'll employ some unsupervised ML tecnhiques for data visualizations and perform skin cancer image segmentation in addition to just classification!

The general outline for this project is as follows:
*   Notebook 1: Exploring Skin Cancer data and developing basic ML models with Computer Vision
*   Notebooks 2 and 3: Developing more advanced ML models and deploying ML to a web app
*   Notebook 4: Checking for bias in ML models performing skin cancer diagnosis
*   Notebook 5: Exploring more advanced ML methods for skin cancer diagosis and lesion segmentation

In this notebook we'll be:
*   Understanding our dataset
*   Performing data preprocessing
*   Learning how to manipulate images with OpenCV
*   Artificially increasing our dataset's size
*   Creating basic ML models with our dataset

# Understanding our Dataset

Our dataset contains over 10,000 skin lesion images that fall into one of seven classes. These classes are melanocytic nevus, melanoma, benign keratosis, basal cell carcionoma, actinic keratosis, dermatofibroma, and vascular lesions.

*   Melanocytic Nevus is the medical term used to denote a mole that originates from the melanocytes in the skin. These are harmless artifacts found on the skin.

*   Melanoma is a very serious form of skin cancer that originates from melanocyctes, cells in skin that produce melanin.

*   Benign Keratoses or Seborrheic Keratoses are skin artifacts that are not cancerous but form due to aging.

*   Basal Cell Carcinoma is a common form of skin cancer that originates from the basal cells. These cells replace the skin cells that die off.

*   Actinic Keratoses (Bowen's Disease) are a form of skin lesions that originate due to old age and sun exposure. These lesions are considered to be "pre-cancerous" and can develop to be cancerous.

*   Dermatofibroma are harmless skin bumps that form due to an overgrowth of various skin cells.

*   Vascular lesions are skin artifiacts often referred to as birthmarks. These lesions appear due to clustering of blood vessels.

![alt text](https://workshop2018.isic-archive.com/images/task3.png)

Our images are sourced from the HAM10000 dataset which is publically available. Each image contains RGB data and is of the pixel dimensions 800 x 600. The images in the dataset are collected from a dermoscope, a tool that is used by dermatologists to image skin lesions. A dermoscope enhances images by providing maginification and adequate lighting.

![alt text](https://upload.wikimedia.org/wikipedia/commons/e/e6/Dermatoscope1.JPG)

In [None]:
#@title Run this to download data and prepare our environment! { display-mode: "form" }
from google.colab.output import eval_js

import time
start_time = time.time()

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tqdm.notebook import tqdm

import keras
from keras import backend as K
from tensorflow.keras.layers import *
from keras.models import Sequential
from keras.layers import Dense, Conv2D
from keras.layers import Activation, MaxPooling2D, Dropout, Flatten, Reshape
from keras.wrappers.scikit_learn import KerasClassifier

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os
import random
from PIL import Image
import gdown

import argparse
import numpy as np
from keras.layers import Conv2D, Input, BatchNormalization, LeakyReLU, ZeroPadding2D, UpSampling2D
from keras.layers.merge import add, concatenate
from keras.models import Model
import struct
from google.colab.patches import cv2_imshow
from copy import deepcopy
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_auc_score
from sklearn.base import BaseEstimator

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from keras.applications.mobilenet import MobileNet

!pip install hypopt
from hypopt import GridSearch

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering

!pip install -U opencv-contrib-python
import cv2

!pip install tensorflowjs
import tensorflowjs as tfjs

from google.colab import files

import requests, io, zipfile

# Prepare data

images_1 = os.makedirs('images_1', exist_ok=True)
images_2= os.makedirs('images_2', exist_ok=True)
images_all= os.makedirs('images_all', exist_ok=True)

metadata_path = 'metadata.csv'
image_path_1 = 'images_1.zip'
image_path_2 = 'images_2.zip'
images_rgb_path = 'hmnist_8_8_RGB.csv'

!wget -O metadata.csv 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20(Healthcare%20B)%20Skin%20Cancer%20Diagnosis/metadata.csv'
!wget -O images_1.zip 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20(Healthcare%20B)%20Skin%20Cancer%20Diagnosis/images_1.zip'
!wget -O images_2.zip 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20(Healthcare%20B)%20Skin%20Cancer%20Diagnosis/images_2.zip'
!wget -O hmnist_8_8_RGB.csv 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20(Healthcare%20B)%20Skin%20Cancer%20Diagnosis/hmnist_8_8_RGB.csv'
!unzip -q -o images_1.zip -d images_1
!unzip -q -o images_2.zip -d images_2

!pip install patool
import patoolib

import os.path
from os import path

from distutils.dir_util import copy_tree

fromDirectory = 'images_1'
toDirectory = 'images_all'

copy_tree(fromDirectory, toDirectory)

fromDirectory = 'images_2'
toDirectory = 'images_all'

copy_tree(fromDirectory, toDirectory)

print("Downloaded Data")

# Preparing Our Dataset for Analysis

In [None]:
IMG_WIDTH = 100
IMG_HEIGHT = 75

We'll start off by separating our dataset into the `X` and `y` variables. `X` represents our input data (images), and `y` represents our data's labels (skin lesion classification). Each image is scaled down to be 100 px by 75 px to reduce the memory footprint. We'll also create a variable `X_gray`, that is the grayscale equivalent of our `X` variable.

One reason for performing these grayscale transformations could be to reduce bias in a classifier. This could prevent the ML model from becoming dependent on the color of the skin, as opposed to the features present in the actual skin cancer lesion. Another reason could lie with the need to reduce the dimensionality of our dataset for our simple ML classifiers we'll train later on. The less complex the data is for training, the less likely our models is to overfit on the data. By performing this grayscale operation, we're reducing our RGB values for each pixel into one grayscale value from 0 to 255.

As there are over 10,000 images, this code segment may take a few minutes to run.

In [None]:
X = []
X_gray = []
y = []

In [None]:
#@title Run this to initialize our X, X_gray, and y variables { display-mode: "form" }
metadata = pd.read_csv(metadata_path)
metadata['category'] = metadata['dx'].replace({'akiec': 0, 'bcc': 1, 'bkl': 2, 'df': 3, 'mel': 4, 'nv': 5, 'vasc': 6,})


for i in tqdm(range(len(metadata))):
  image_meta = metadata.iloc[i]
  path = os.path.join(toDirectory, image_meta['image_id'] + '.jpg')
  img = cv2.imread(path,cv2.IMREAD_COLOR)
  img = cv2.resize(img,(IMG_WIDTH,IMG_HEIGHT))

  img_g = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
  X_gray.append(img_g)

  X.append(img)
  y.append(image_meta['category'])

X_gray = np.array(X_gray)
X = np.array(X)
y = np.array(y)

Let's take a look at an example of what our data looks like! Let's explore the dataset for different indicies in our `X` variable!

In [None]:
cv2_imshow(X[0])

Let's take a look at the shape of our updated `X`, `X_gray`, and `y` variables

In [None]:
print(X_gray.shape)
print(X.shape)
print(y.shape)

It looks like we've got a total of 10,015 images in our dataset. Plotting a graph of the distribution of labels found in the dataset can help us determine if we need to balance the data.

In [None]:
#@title Run this to plot the distribution of our dataset { display-mode: "form" }
objects = ('akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc')
y_pos = np.arange(len(objects))
occurances = []

for obj in objects:
  occurances.append(np.count_nonzero(obj == metadata['dx']))

print(occurances)

plt.bar(y_pos, occurances, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Samples')
plt.title('Distribution of Classes Within Dataset')

plt.show()

This bar chart clearly informs us that our dataset is very unbalanced. There are far more nevi samples than there are samples of any other class.

For the sake of reducing execution and training time, we'll be cutting down the size of our dataset. To observe the full performance of our model and see the complete extent of our visualizations, we can comment out the following lines and re-run the notebook. However, note that some code blocks will take much longer to run.

We can decide between the following methods we can use to reduce our dataset size. We only run one of the two code blocks. The first option reduces the dataset size far more than the second option. Specify which option you would like to proceed with by setting the value for the variable `option`.

In [None]:
sample_cap = 142
option = 1

In [None]:
#@title Option 1: Run this to reduce dataset size. This method caps each class at *sample_cap* samples. { display-mode: "form" }
if (option == 1):
  objects = ['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']
  class_totals = [0,0,0,0,0,0,0]
  iter_samples = [0,0,0,0,0,0,0]
  indicies = []

  for i in range(len(X)):
    class_totals[y[i]] += 1

  print("Initial Class Samples")
  print(class_totals)

  for i in range(len(X)):
    if iter_samples[y[i]] != sample_cap:
      indicies.append(i)
      iter_samples[y[i]] += 1

  X = X[indicies]
  X_gray = X_gray[indicies]

  y = y[indicies]

  class_totals = [0,0,0,0,0,0,0]

  for i in range(len(X)):
    class_totals[y[i]] += 1

  print("Modified Class Samples")
  print(class_totals)
else:
  print("This option was not selected")

In [None]:
#@title Option 2: Run this to reduce dataset size. This method only reduces the number of *nv* samples to be the same amount as the number of samples found in the second most prevalent class. { display-mode: "form" }
if (option == 2):
  objects = ['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']
  class_totals = [0,0,0,0,0,0,0]

  for i in range(len(X)):
    class_totals[y[i]] += 1

  print("Initial Class Samples")
  print(class_totals)

  largest_index = class_totals.index(max(class_totals))
  class_totals[largest_index] = 0

  second_largest_val = max(class_totals)

  indicies = []
  iter = 0
  for i in range(len(X)):
    if y[i] == largest_index:
      if iter != second_largest_val:
        indicies.append(i)
        iter += 1
      else:
        continue
    else:
      indicies.append(i)

  class_totals = [0,0,0,0,0,0,0]

  for i in range(len(X)):
    class_totals[y[i]] += 1

  print("Modified Class Samples")
  print(class_totals)

  X = X[indicies]
  X_gray = X_gray[indicies]

  y = y[indicies]
else:
  print("This option was not selected")

By running the second code block above, our dataset is no longer imbalanced. This would mean that we could use accuracy as a metric for performance.

# OpenCV Image Manipulation

Having a large and rich dataset allows our model to be exposed to different types of images and in turn, perform better when given images to classify.

Consider professional images taken with proper medical equipment by a dermatologist. These images are more likely to be clearer and in focus, when compared with those taken by an amateur with a cell phone camera. However, as both types of images are likely to be sent to our ML model for classification, its important that we prepare our model for both situations.

One method of increasing our dataset's size is called *data augmentation*. Through data augmentation, we take existing images from our dataset, and duplicate a version of that image with an image transformation applied to it. This process can be repeated multiple times, and the dataset size can increase ten-fold or greater. Well, what does this mean in practice?

Let's explore this further with the example of this *Jaguar* sports car.

In [None]:
#@title Run this to download our Jaguar car image! { display-mode: "form" }
!wget -O jaguar.jpeg 'https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20(Healthcare%20B)%20Skin%20Cancer%20Diagnosis/jaguar.jpeg'

In [None]:
jaguar = cv2.imread("jaguar.jpeg")
cv2_imshow(jaguar)

This is a very crisp and detailed image. But what if in the future, after the model has been trained on numerous images like this, it's presented with an image of lower quality? What if the camera wasn't completely in focus?

One solution is to artificially generate these lower quality pictures ourselves. This way we could expose the model to lower quality images and prepare it in case it recieve similar images in the future.

To manipulate and edit images, we'll be exploring the use of the OpenCV library. OpenCV is a very powerful image processing library that has the capability to transform images in numerous ways.

Here are two functions in OpenCV:



*   `cv2.resize(image,(new_width,new_height))` resizes an image.
*   `cv2.blur(image,(kernel_size,kernel_size))` blurs an image. The `kernel_size` argument indicates how wide and how high the window that smoothes the image is. The larger the kernel size, the more intense the blur.

Also, we can use `cv2_imshow()` to view our images in Colab. If not in a Colab ntoebook environment, we can use `cv2.imshow()`.

In [None]:
# Blur
blur_jaguar = cv2.blur(jaguar,(4,4))
cv2_imshow(blur_jaguar)

# Resize
small_jaguar = cv2.resize(jaguar,(455,256))
normal_jaguar = cv2.resize(small_jaguar,(910,511))
cv2_imshow(normal_jaguar)

Let's say that our classifier was comparing between Ferraris and Jaguars. and in most of the training data, the Ferraris were red and the Jaguars were black.

OpenCV uses the `BGR` coloring scheme as opposed to the traditional `RGB` coloring scheme for its images. This means that the first element of each pixel is the blue channel, the second element is the green channel, and the third element is the red channel. The function `cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)` converts an image to black and white, while the function `cv2.flip(image,i)` flips an image. When `i` is `0`,`1`,or `-1`, the image is flipped against the x-axis, y-axis, or against both axes respectively. We can try creating some images that could reduce the likelihood of the aforementioned errors occuring:

In [None]:
# Grayscale
jaguar_bw = cv2.cvtColor(jaguar,cv2.COLOR_BGR2GRAY)
cv2_imshow(jaguar_bw)

# Flip
jaguar_flip = cv2.flip(jaguar,0)
cv2_imshow(jaguar_flip)

Another image transformation we can implement is a *zoom*.

In [None]:
#Zoom into our image
zoom = 0.33

centerX,centerY=int(jaguar.shape[0]/2),int(jaguar.shape[1]/2)
radiusX,radiusY= int((1-zoom)*jaguar.shape[0]*2),int((1-zoom)*jaguar.shape[1]*2)

minX,maxX=centerX-radiusX,centerX+radiusX
minY,maxY=centerY-radiusY,centerY+radiusY

cropped = jaguar[minX:maxX, minY:maxY]
zoom_img = cv2.resize(cropped, (jaguar.shape[1], jaguar.shape[0]))
cv2_imshow(zoom_img)

Now, we've explored many different image operations we can perfom using OpenCV. Now, let's head back to our skin cancer image dataset, and apply what we've learned there!

# Data Augmentation

Although our dataset is very expansive with over 10,000 images, we can generate more samples so that our model is prepared to cope with a more varied dataset. Through data augmentation, we can perform random operations such as a flip, blur, or zoom on existing images, to create new image samples. It's important to note that these data augmentation procedures should only be applied to the training dataset.


**Note:** Everything except for converting to grayscale and thresholding/segmentation should work (for color images) well with our dataset.

Let's first complete our test/train split for both our grayscale image data and our color image data.

In [None]:
X_gray_train, X_gray_test, y_train, y_test = train_test_split(X_gray, y, test_size=0.4, random_state=101)

Let's also perform a test/train split for `X` and `y`: the color image data and the respective labels. We need to create `X_train, X_test, y_train, y_test`.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

We'll now iterate through all the images in the training slice of our dataset and create a duplicate with a random transformation, doubling our training dataset's size. In this code block, we'll randomly decide to flip the image across the y-axis or apply a 33% zoom.

In [None]:
X_augmented = []
X_gray_augmented = []

y_augmented = []

for i in tqdm(range(len(X_train))):
  transform = random.randint(0,1)
  if (transform == 0):
    # Flip the image across the y-axis
    X_augmented.append(cv2.flip(X_train[i],1))
    X_gray_augmented.append(cv2.flip(X_gray_train[i],1))
    y_augmented.append(y_train[i])
  else:
    # Zoom 33% into the image
    zoom = 0.33

    centerX,centerY=int(IMG_HEIGHT/2),int(IMG_WIDTH/2)
    radiusX,radiusY= int((1-zoom)*IMG_HEIGHT*2),int((1-zoom)*IMG_WIDTH*2)

    minX,maxX=centerX-radiusX,centerX+radiusX
    minY,maxY=centerY-radiusY,centerY+radiusY

    cropped = (X_train[i])[minX:maxX, minY:maxY]
    new_img = cv2.resize(cropped, (IMG_WIDTH, IMG_HEIGHT))
    X_augmented.append(new_img)

    cropped = (X_gray_train[i])[minX:maxX, minY:maxY]
    new_img = cv2.resize(cropped, (IMG_WIDTH, IMG_HEIGHT))
    X_gray_augmented.append(new_img)

    y_augmented.append(y_train[i])

X_augmented = np.array(X_augmented)
X_gray_augmented = np.array(X_gray_augmented)

y_augmented = np.array(y_augmented)

X_train = np.vstack((X_train,X_augmented))
X_gray_train = np.vstack((X_gray_train,X_gray_augmented))

y_train = np.append(y_train,y_augmented)

In [None]:
#@title Run this to Combine Augmented Data with Existing Samples { display-mode: "form" }
X_augmented = np.array(X_augmented)
X_gray_augmented = np.array(X_gray_augmented)

y_augmented = np.array(y_augmented)

X_train = np.vstack((X_train,X_augmented))
X_gray_train = np.vstack((X_gray_train,X_gray_augmented))

y_train = np.append(y_train,y_augmented)

Let's view the shape of our training variables after data augmentation.

In [None]:
print(X_gray_train.shape)
print(X_train.shape)
print(y_train.shape)

We can try performing two additional image transformations with OpenCV for data augmentation!

In [None]:
X_augmented = []
X_gray_augmented = []

y_augmented = []

for i in tqdm(range(len(X_train))):
  transform = random.randint(0,1)
  if (transform == 0):

    # Resize the image by half on each dimension, and resize back to original
    # dimensions

    small_image = cv2.resize(X_train[i],(IMG_WIDTH//2,IMG_HEIGHT//2))
    normal_image = cv2.resize(small_image,(IMG_WIDTH,IMG_HEIGHT))

    small_grayscale_image = cv2.resize(X_gray_train[i],(IMG_WIDTH//2,IMG_HEIGHT//2))
    normal_grayscale_image = cv2.resize(small_grayscale_image,(IMG_WIDTH,IMG_HEIGHT))

    X_augmented.append(normal_image)
    X_gray_augmented.append(normal_grayscale_image)
    y_augmented.append(y_train[i])
  else:

    # Blur the image with a 4 x 4 kernel

    X_augmented.append(cv2.blur(X_train[i],(4,4)))
    X_gray_augmented.append(cv2.blur(X_gray_train[i],(4,4)))
    y_augmented.append(y_train[i])

In [None]:
#@title Run this to Combine Augmented Data with Existing Samples { display-mode: "form" }
X_augmented = np.array(X_augmented)
X_gray_augmented = np.array(X_gray_augmented)

y_augmented = np.array(y_augmented)

X_train = np.vstack((X_train,X_augmented))
X_gray_train = np.vstack((X_gray_train,X_gray_augmented))

y_train = np.append(y_train,y_augmented)

# Creating Basic Machine Learning Models

Now that we've implemented data augmentation into our pipeline and artificially generated more samples for our dataset, lets test out various ML models.



Let's start off by creating a K Nearest Neighbors model.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)

Scikit-learn takes feature vectors as data samples (1D arrays). However, images have at least 2 dimensions.

Let's perform an operation known as *image flattening* with our grayscale image data. In this operation, we reshape our images to be a one dimensional array of length 7500 instead of a matrix of dimensions (100 x 75).

In [None]:
X_g_train_flat = X_gray_train.reshape(X_gray_train.shape[0],-1)
X_g_test_flat = X_gray_test.reshape(X_gray_test.shape[0],-1)
print (X_g_train_flat.shape)
print (X_g_test_flat.shape)

Let's train our models on our flattened grayscale images! Once again, due to the size of our dataset, training may take a few minutes for each model.

In [None]:
knn.fit(X_g_train_flat, y_train)

A common way to measure our model's performance uses the Receiver Operator Curve, which shows the relationship between our model's true positive and true negative rate. This metric is especially useful with our scenario, since - unlike accuracy - it doesn't depend on balanced classes in our dataset.

Here is an example of an ROC curve. It shows the true positive rate and true negative rate as we vary the postitive/negative threshold for a classifier. The AUC, or Area Under the Curve, is the metric we use. The greater the area - the closer to the top left the curve lies - the better the model. A model that guesses randomly would fit the 45 degree line.



We'll define a function called `model_stats()` that prints the models performance. Specifically, it will print the model's name, its accuracy, and its ROC AUC value. Before we create our function, there's one more thing to cover.

When we calculate the ROC AUC score, we have to compare the true test labels, `y_test`, against the predicted probabilities for each class for every sample. What does this mean? Let's take a look at an example!

If a model was predicting class `0` in a classifier, the one hot repesentation of this would be `[1, 0, 0, 0]`. The probabilistic representation of this could be `[0.4, 0.2, 0.2, 0.2]`. All the probability values in the array add up to `1`. Each element of this array represents the probability of that specific class being predicted by the classifier. For example, the probability of class `0` being predicted is represented by the value in the zeroeth element of the array. In this case that would be `0.4`.

We can calculate these probability arrays using the `predict_proba()` function.

We can code for model_stats. To calculate ROC AUC scores, we can use the `roc_auc_score()` function.

In [None]:
def model_stats(name, y_test, y_pred, y_pred_proba):
  cm = confusion_matrix(y_test, y_pred)

  print(name)

  accuracy = accuracy_score(y_test,y_pred)
  print ("The accuracy of the model is " + str(round(accuracy,5)))

  roc_score = roc_auc_score(y_test, y_pred_proba, multi_class='ovo')

  print ("The ROC AUC Score of the model is " + str(round(roc_score,5)))

  return cm

Let's run the function and observe the performance of our K Nearest Neighbors model.

In [None]:
y_pred = knn.predict(X_g_test_flat)
y_pred_proba = knn.predict_proba(X_g_test_flat)

knn_cm = model_stats("K Nearest Neighbors",y_test,y_pred,y_pred_proba)

There seems to a big discrepancy between our accuracy and ROC AUC scores. Why is that? Let's take a look at some plots of the confusion matrices. Let's create a function called `plot_cm()`, that we will use to plot the confusion matrices.

In [None]:
def plot_cm(name, cm):
  classes = ['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']

  df_cm = pd.DataFrame(cm, index = [i for i in classes], columns = [i for i in classes])
  df_cm = df_cm.round(5)

  plt.figure(figsize = (12,8))
  sns.heatmap(df_cm, annot=True, fmt='g')
  plt.title(name + " Model Confusion Matrix")
  plt.xlabel("Predicted Label")
  plt.ylabel("True Label")
  plt.show()

Let's run our new function for KNN classifier. Remember that we have seven classes, so an accuracy that seems horribly low (like 50%) isn't as bad as it might appear!

In [None]:
plot_cm("K Nearest Neighbors",knn_cm)

It seems that while many nevi images were accurately classified, many other images of other classes were incorrectly classified as nevi. Due to our dataset being very imbalanced, the accuracy is misleading, as it is sensitive to imbalanced data. In addition, an AUC ROC score close to 0.5 indicates that the model is not capable of discriminating between the classes very well at all.

Let's try modifying our KNN model's architecture and hyperparameters to increase our model's performance. We can use a library called *hypopt* to automate this process through a *grid search*. We'll automatically try out many possible hyperparameters for our machine learning algorithm to see which give the best performance.

Before we perform a grid search, we need to create an additional slice of our dataset. As of now, we have 60% of the data allocated for training, and 40% for testing. We'll now create a new slice of our dataset called the validation dataset. The validation dataset will be used for model testing during the grid search and will comprise 50% of our testing set, or 20% of the entire dataset.

Let's make create the new variables `X_gray_test, X_gray_val, y_g_test, y_g_val`. Also let's create `X_gray_val_flat` and `X_gray_test_flat`, our flattened arrays.

In [None]:
X_gray_test, X_gray_val, y_g_test, y_g_val = train_test_split(X_gray_test, y_test, test_size=0.5, random_state=101)

X_gray_test_flat = np.reshape(X_gray_test,(X_gray_test.shape[0],X_gray_test.shape[1]*X_gray_test.shape[2]))
X_gray_val_flat = np.reshape(X_gray_val,(X_gray_val.shape[0],X_gray_val.shape[1]*X_gray_val.shape[2]))

In [None]:
X_gray_test.shape

In the variable `param_grid` we can specify which parameters in our KNN Classifier we want to modify.

In [None]:
param_grid = {
              'n_neighbors' :     [2, 3, 4, 5],
              'weights' :          ['uniform', 'distance'],
              'algorithm' :        ['ball_tree', 'kd_tree', 'brute']
             }

Let's initialize and fit our grid search optimizer. This can take a while!

In [None]:
gs_knn = GridSearch(model=KNeighborsClassifier(),param_grid=param_grid)

gs_knn.fit(X_g_train_flat.astype(np.float32), y_train.astype(np.float32),
       X_gray_val_flat.astype(np.float32), y_g_val.astype(np.float32),verbose=1)

Now, the model will be trained with the best hyperparameters. Let's try evaluating its performance:

In [None]:
y_pred = gs_knn.predict(X_gray_test_flat)
y_pred_proba = gs_knn.predict_proba(X_gray_test_flat)
gs_knn_cm = model_stats("Grid Search KNN",y_g_test,y_pred,y_pred_proba)

Let's also plot the confusion matrix.

In [None]:
plot_cm("Grid Search KNN",gs_knn_cm)

Seems like the grid search didn't improve the model's performance. It could be that this ML model is unable to handle the dimensionality of our dataset.

Let's try out some of these other ML models as well. Perhaps some of these would perform better.

In [None]:
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

That's a wrap for this notebook! In the next notebook, we'll create more complex ML models (which will hopefully work better!) and finally deploy our models to a web app.