Hey! In this notebook, we are going to be testing out our models using either the DDI dataset or the SynthDerm dataset to assess bias. So, we are going to test a model over the entire dataset, as well as by aggregated fitzpatrick skin type. In this, we are going to be looking for a difference in metrics across skin types, which will indicate the existence of bias.

When testing on the entire dataset, we generate the following metrics:

1. Classification Report
2. ROC-AUC
3. Confusion Matrix

When analyzing the dataset across each skin type, we will generate the following metrics:

1. Accuracy
2. ROC-AUC
3. Classification Report
4. F1-Scores
5. Confusion Matrix

Lastly, we will visually analyze the model to get a more in-depth view on the types of images where are model is failing (ex: never classifies images with hair correctly).

**Public Release Notes**:

1. Some code has been omitted and variables set to "None". This is done in the interest of privacy. If you are attempting to run this code on your own and encounter and trouble, send me an email and I will assist you however I can.

# **Installs and Imports**

As usual, we will just import the necessary libraries before we start.

In [1]:
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pandas as pd
from PIL import Image

import numpy as np

import tensorflow as tf
from tensorflow import keras

import sklearn
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

import shutil

from glob import glob
import io

print(tf.version.VERSION)

2.13.0


# **Import a Model**

The first thing that we are going to do is import a model. For this, we have saved all of our models on google drive so we will first mount our drive, then we will import the model given the file path.

If the model's file path is not saved on drive, you will need to first upload the model's file from your local machine (disregard if you are running this locally).

In [None]:
# Mount your google drive
from google.colab import drive
drive.mount('/content/drive')

Next, just go through the drive directory and find the file that you want to upload. In this case, we are uploading the model we are going to use for testing. When you have found it, right click the file and copy the path.

In [None]:
# !!!! SET THIS TO MATCH YOUR DRIVE !!!!
model_path = None

# Load the model
model = tf.keras.models.load_model(model_path)
model.summary()

Now that we have a model, we can go ahead and prepare our dataset.

# **Loading A Dataset**

Here you can load whichever dataset that you would like. You either load the SynthDerm dataset, or the DDI dataset. **DO NOT RUN ALL OF THE BELOW CODE.** Only run section (a) or (b), depending on the desired dataset.

Firstly, just set up the path for the dataset zip file below.

In [4]:
# !!!! CHANGE THIS PATH TO UNZIP THE CORRECT FILE !!!!
ds_path = None

## **(a) Loading SynthDerm**

Firstly, we just want to download and unzip our dataset so that we can start setting it up. Since you have mounted your drive, the downloading is basically already done, so we can just unzip it.

Also, make sure that you are using the SynthDermPrepared dataset and not the SynthDerm dataset. The only difference is that I pre-combined all of the metadata and added the Fitzpatrick skin type labels in the prepared dataset. The only thing that we will need to add to this dataset is the image path. So, we are going to read in our metadata into a dataframe and then just add an image path for each image.

In [None]:
# Unzip and delete unused folder
!unzip "$ds_path" -d /content
!rm -r /content/__MACOSX

Here, we are just going to set some dataset specific variables that we are going to use now and later during testing.

In [None]:
# AUTO SET DATASET VARIABLES
fitz_col = 'fitzpatrick_type'
classes = ['Benign', 'Malignant']
fitz_keys = [1, 2, 3, 4, 5, 6]

Now we can read in the metadata from the csv file and add a file path to each image in the metadata.

In [None]:
# Read in the metadata from the csv file
df = pd.read_csv('/content/SynthDermPrepared/SynthDerm_metadata.csv')

# Get the paths for all images
base_img_dir = '/content/SynthDermPrepared/images'
df['path'] = df['image_id'].map(lambda img_id: base_img_dir + '/' + img_id)

df.head()

Now our dataframe is all set up, so we can continue and use the dataframe to create our dataset object.

## **(b) Loading DDI**

The following code with load the dataframe for the DDI dataset. Since we are switching between a few different versions of this dataset, adjust the values as needed.

All we are going to do below is:

1. Unzip the dataset zip file
2. Setup some variables that we will use to reference the specific dataset
3. Read in the metadata and add image paths, labels, and label indexes.

To start, let's unzip the dataset file.

In [None]:
# Unzip and delete unused folder
!unzip "$ds_path" -d /content
!rm -r /content/__MACOSX

Now, let's setup some dataset specific variables that we will use now and later in testing.

In [6]:
# Set the values for various dataset variables.
fitz_col = 'skin_tone'
classes = ['Benign', 'Malignant']
fitz_keys = [12, 34, 56]

Now, we can read in the metadata into a dataframe and add our image paths, labels, and label indexes.

In [None]:
# Read in the csv file
metadata_filepath = '/content/ddi/updated_ddi_metadata.csv'
df = pd.read_csv(metadata_filepath)

# Setup the paths in the csv file
base_img_dir = '/content/ddi/images'
df['path'] = df['DDI_file'].map(lambda img_file: base_img_dir + '/' + img_file)

# Setup the label and label_index
df['label_ind'] = df['malignant'].map(lambda x: 1 if x else 0)
df['label'] = df['label_ind'].map(lambda x: classes[x])

# Drop the unused column
df.pop('Unnamed: 0')

df.head()

Now our dataframe is all set up, so we can continue and use the dataframe to create our dataset object.

# **Loading the Images Into A Dataset Object**

Now, one of the first things that we are going to do is get rid of any images that don't have a fitzpatrick skin type assigned.

In [10]:
# Drop all data that doesn't have a fitzpatrick label
df = df[df[fitz_col].notna()]

Now that we have the dataframe all set up, we can go ahead and get all of the paths and set up a Dataset object.

Note, we don't need to add the labels to the dataset object directly as the model won't see them. We are just using the dataset object to feed in the images to get predictions that we can then line up with the labels in our dataframe. Though, it is really important that we **don't shuffle our dataset** at all in this process. Otherwise, when we add the model's predictions for each image back into the dataframe for analysis, the predictions would be invalid.

In [11]:
new_img_size = (224, 224)  # matches the input size of the model
AUTOTUNE = tf.data.AUTOTUNE

# Reads in an image given its file path and resizes it to new_img_size
@tf.function
def prepare_image(file_path):
  image = tf.io.read_file(file_path)
  image = tf.image.decode_image(image, channels=3, expand_animations=False)
  image = tf.image.resize(image, new_img_size)
  image = tf.expand_dims(image, axis=0)
  return image

# Extract the paths and labels and make a dataset object
paths = df['path'].to_numpy()
dataset = tf.data.Dataset.from_tensor_slices(paths)
dataset = dataset.map(prepare_image, num_parallel_calls=AUTOTUNE)

# **Making and Processing Model Predictions**

Now that our model and dataset are setup, we can go ahead and use our model to get a prediction for each image.

Firstly, if you would like to use a GPU to speed things up, here is the code to set it up.

In [12]:
# Check if GPU is accessible to TF
tf.config.list_physical_devices('GPU')
tf.debugging.set_log_device_placement(True)

# Get the GPU memory fraction to allocate
gpu_memory_fraction = 0.65

# Create GPUOptions with the fraction of GPU memory to allocate
gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction)

# Create a session with the GPUOptions
session = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))

Now, we can go ahead and use the model to make predictions on the data.

In [None]:
# Make predictions of the test data
test_pred = model.predict(dataset)

# Save the continuous predictions in the dataframe
df['cont_prediction'] = test_pred
df.head()

Next, we are just going to interpret the predictions, depending on the model (binary classifier vs multi classifier), the number of classes in the dataset, and a specified threshold.

The code below will allow us to interpret our multi-class models in comparison to our binary dataset. When we do this, we are taking any labels that are either melanoma (1) or carcinoma (2), and making them malignant labels. Any benign label (0) are kept as benign labels.

In [None]:
# SET THE THRESHOLD FOR TESTING
threshold = 0.5

# Make a custom error to catch a mismatch between model and data classes
class DsModelMismatchError(Exception):
  "CANNOT CONVERT BINARY CLASSIFICATION TO MULTI-CLASS CLASSIFICATION"
  pass

# Determine the number of classes in the dataset and if the model is a binary classifier or not
ds_classes = len(classes)
binary_classifier_model = model.layers[-1].output_shape[1] <= 1

# Convert the model predictions to match the dataset
if ds_classes > 2 and not binary_classifier_model:
  # DS and Model both multi-class
  print("Both Dataset and Model are multi-class -- Computing argmax of output tensor")
  df['prediction'] = np.argmax(test_pred, axis=-1)
elif ds_classes == 2 and binary_classifier_model:
  # DS and Model both Binary
  print("Both Dataset and Model are Binary Classification -- ", end='')
  # Apply sigmoid activation if it has not been done and round with threshold = 0.5
  if (max(test_pred) <= 1) and (min(test_pred) >= 0):
    #test_pred = [np.array(x).round().astype(int).item() for x in test_pred]
    df['prediction'] = df['cont_prediction'].map(lambda x: 1 if x > threshold else 0)
    print("No Activation Done - All Values in [0, 1] - Rounding with Threshold = " + str(threshold))
  else:
    df['prediction'] = df['cont_prediction'].map(lambda x: tf.keras.activations.sigmoid(x))
    df['prediction'] = df['prediction'].map(lambda x: 1 if x > threshold else 0)
    print("Sigmoid Activation Done - Rounding with Threshold = " + str(threshold))
elif ds_classes == 2 and not binary_classifier_model:
  # Convert multi-class predictions to binary classification
  print("Converting Multi-Class Classifications into Binary Classifications")
  test_pred = np.argmax(test_pred, axis=-1)
  df['prediction'] = np.array([0 if x == 0 else 1 for x in test_pred])
else:
  raise DsModelMismatchError


df.head()

Now we have each image paired with its metadata, its continuous prediction, and its discrete prediction. So, we can start analyzing how the model did.

# **Analyzing The Predictions**

Now that we have the predictions lined up with the metadata, we can start analyzing how the model did over the dataset.

Firstly though, if you want to save the dataframe with the predictions included so that you can come back to this without having to run all of the above code, run the below cell. Just make sure to give it a good filename, and download it to your local storage afterwards.

In [None]:
# Set the output filename for the csv file
csv_output_filename = None

# SAVE THE METADATA WITH PREDICTIONS TO A CSV FILE
df.to_csv(csv_output_filename)

Likewise, use the code below to load up a dataframe from a csv file, if you saved it from the code above. Just make sure to go back and run the install and imports, as well as any cells when making the dataset that create variables specific to that dataset (eg "fitz_col")

In [None]:
df = pd.read_csv(None)
df.head()

Now, if you only want to look at certain parts of the dataset in our testing, run the following cells below. For example, if you only want to look at the images in the testing subset or images that have "common" diagnoses.

**If you wish to keep all images in the testing, DO NOT RUN THE BELOW TWO CELLS**

In [None]:
# Remove all data that are not a part of the test set
df = df[df['Set'] == 'Test']
print("WARNING -- REMOVING ALL TRAINING SET IMAGES FROM DATA -- WARNING")
df.head()

In [None]:
# Remove all data that are not considered to have a common diagnoses
df = df[df['common'] == True]
print("WARNING -- REMOVING ALL UNCOMMON IMAGES FROM DATA -- WARNING")
df.head()

## **Analyzing The Model With The Entire Dataset**

Before we start to analyze how the model did against each aggregated skin type, let's first see how well it did against the dataset as a whole.

Firstly, let's create a classification report.

In [None]:
labels = df['label_ind'].to_numpy()
test_pred = df['prediction'].to_numpy()
print(classification_report(labels, test_pred))

Next, let's analyze the ROC-AUC for the entire dataset.

In [None]:
# Set the title for the graph
ds = 'All DDI Images'
title = 'ROC Curve - ' + ds

# Get the ROC-AUC data from the test results
fpr, tpr, threshold = roc_curve(labels, df['cont_prediction'].to_numpy())
auc_keras = auc(fpr, tpr)

# Plot the ROC-AUC curve
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model (area = {:.3f})'.format(auc_keras))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title(title)
plt.legend(loc='best')
plt.show()

Lastly, let's create a confusion matrix to analyze how the model on the overall dataset.

In [None]:
# Create the confusion matrix
labels = df['label_ind'].to_numpy()
preds = df['prediction'].to_numpy()
cm = sklearn.metrics.confusion_matrix(labels, preds)

# Plot the consuion matrix
disp = sklearn.metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
disp.plot()
plt.title('Confusion Matrix - Entire Dataset')
plt.show()

Lastly, while we have found this data before, let's go ahead and print the distribution of labels in the dataset we are using.

In [None]:
category_to_chart = 'label'

# Plot the distribution with its percentage.
fig, ax1 = plt.subplots(1, 1, figsize= (10, 5))
ax = df[category_to_chart].value_counts().plot(kind='bar', ax=ax1, title='Test Dataset Label Distribution',
                                               xlabel='Class', ylabel='Count')
ax.bar_label(ax.containers[0])

counts = df[category_to_chart].value_counts()
percents = counts.map(lambda x: (x / counts.sum()) * 100.0)
print("Percentages: ")
print(percents)

## **Analyzing Across Fitzpatrick Skin Type**

Now that we have looked at how the model did overall, let's look at how the model did against each Fitzpatrick skin type. Here, we will be computing and comparing the accuracy, the ROC-AUC, and the confusion matrices.

### Plotting the Fitzpatrick Skin Type Distribution of the Dataset

Now, we can see how the model did for certain skin types.

First though, let's just see the distribution of the dataset.

In [None]:
# CHANGE THIS TO PLOT DIFFERENT CATEGORIES
category_to_chart = fitz_col

# Plot the distribution with its percentage.
fig, ax1 = plt.subplots(1, 1, figsize= (10, 5))
ax = df[category_to_chart].value_counts().sort_index().plot(kind='bar', ax=ax1,
                                                            title='Test Dataset Fitzpatrick Skin Type Distribution',
                                                            xlabel='Fitzpatrick Skin Type',
                                                            ylabel='Number of Images')
ax.bar_label(ax.containers[0])

counts = df[category_to_chart].value_counts().sort_index()
percents = counts.map(lambda x: (x / counts.sum()) * 100.0)
print("Percentage of Dataset: ")
print(percents)

Now, let's see the distribution of skin type with information about the classes as well.

Firstly, we're going to define a helper function that will fill in any missing values from our value counts.

Then, we can go ahead and get the aggregated counts and plot them.

In [26]:
# Fills in the missing indexes in the Series 'srs' (must have all values specified in fitz_keys)
# with the value val
def fill_missing(srs, val=0):
  srs_dict = srs.to_dict()
  for key in fitz_keys:
    if key not in srs_dict:
      srs_dict[key] = 0
  return pd.Series(srs_dict).sort_index()

# Get counts for the distribution for each skin type and class
agg_dist_counts = []
for cat in classes:
  # Extract all info for a given class
  sub_df = df.loc[df['label'] == cat]
  # Get counts for each skin type and fill in the missing counts
  skin_type_count = sub_df[fitz_col].value_counts().sort_index()
  skin_type_count = fill_missing(skin_type_count)
  # Add the counts and class (cat) to the list
  agg_dist_counts.append((cat, np.array(skin_type_count)))

Now, we can plot the distribution that we calculated above.

In [None]:
# Plot the distribution across each individual class
x = np.arange(1, len(fitz_keys) + 1)
width = 0.8 / len(classes)  # the width of the bars
multiplier = 0

fig, ax = plt.subplots(figsize=(18, 10), layout='constrained')

for class_name, values in agg_dist_counts:
  offset = width * multiplier
  rects = ax.bar(x + offset, values, width, label=class_name)
  ax.bar_label(rects, padding=3, fontsize='large')
  multiplier += 1

ax.set_xlabel('Fitzpatrick Skin Type', fontsize='large')
ax.set_ylabel('Count', fontsize='large')
ax.set_title('Fitzpatrick Skin Type Class Distribution')
xtick_offset = width if len(classes) > 2 else (0.5 * width)
ax.set_xticks(x + xtick_offset, fitz_keys)
ax.legend(loc='upper right', fontsize='xx-large')

plt.show()

### Analyzing The Model's Accuracy For Each Skin Type Across All Classes

Now that we have seen the distribution in the test set, we can extract the information that we want from the DataFrame. All that is getting done here is we are extracting all of the information for a certain skin type, and then we are extracting from that only the cases where the image was classified correctly. Then we can compare the number of samples in each set and determine an accuracy for that skin type.

In [None]:
agg_acc = []
class_agg_acc = [[], [], []]


# Get a list of the different values for fitz_col
f_types = np.unique(df[fitz_col].to_numpy())

# Get the count of correct predictions for each skin type
for f_type in f_types:
  # Extract the label and prediction for all data of current f_type, then repeat for only correct predictions
  sub_df = df.loc[df[fitz_col] == f_type][['label_ind', 'prediction', 'label']]
  sub_correct_df = sub_df.loc[sub_df['label_ind'] == sub_df['prediction']]

  # Calculate the overall accuracy
  num_correct = len(sub_correct_df)
  num_total = len(sub_df)
  agg_acc.append((num_correct, num_total))

  # Calculate the accuracy for the curr skin type and each label
  for i, cat in enumerate(classes):
    class_sub_df = sub_df.loc[sub_df['label'] == cat]
    class_corr_df = sub_correct_df.loc[sub_correct_df['label'] == cat]
    num_correct = len(class_corr_df)
    num_total = len(class_sub_df)
    class_agg_acc[i].append((num_correct, num_total))

# If the last list of class_agg_acc is empty, delete it
if len(class_agg_acc[-1]) == 0:
    class_agg_acc.pop(-1)


Now that we have a list of all of the counts of number of correct predictions with the total number of images for that type and class, we can process them to see how accurate the model was on our dataset.

Firstly, we will just define a function that will anaylyze one of the lists containing either 3 or 6 (depending on if skin type is already aggregated or not) tuples of (num_correct, num_total). Then, we can use it to analyze the results.

In [30]:
# Processes a list of tuples containing (num_correct, num_total), corresponding
# to the list of skin types (fitz_types) which might or might not be aggregated.
def process_accuracies(acc_list, fitz_types, classification=None, precision=2):
  # Create a DataFrame for the agg_acc
  agg_acc_df = pd.DataFrame(acc_list, columns=['Num_Correct', 'Num_Total'])
  agg_acc_df['Accuracy'] = (agg_acc_df['Num_Correct'] / agg_acc_df['Num_Total']) * 100
  agg_acc_df['Fitzpatrick'] = fitz_types;
  agg_acc_df = agg_acc_df[['Fitzpatrick', 'Accuracy', 'Num_Correct', 'Num_Total']]

  # Aggregate the data to combine Fitzpatrick skin types (1&2, 3&4, 5&6) - if not already done
  if len(acc_list) >= 6:       # if data not already aggregated
    # Mark existing data as not aggregated
    agg_acc_df['Aggregated'] = False
    # Aggregate the values
    comb = []
    for i in range(0, 6, 2):
      num_corr = acc_list[i][0] + acc_list[i+1][0]
      num_tot = acc_list[i][1] + acc_list[i+1][1]
      comb.append((num_corr, num_tot))
    comb_df = pd.DataFrame(comb, columns=['Num_Correct', 'Num_Total'])
    comb_df['Accuracy'] = (comb_df['Num_Correct'] / comb_df['Num_Total']) * 100
    comb_df['Fitzpatrick'] = ['1&2', '3&4', '5&6']
    comb_df = comb_df[['Fitzpatrick', 'Accuracy', 'Num_Correct', 'Num_Total']]
    comb_df['Aggregated'] = True
    # Combine the non-aggregated df and aggregated df
    agg_acc_df = pd.concat([agg_acc_df, comb_df])
  else:
    agg_acc_df['Aggregated'] = True

  # If the class is specified, add it to the dataframe
  if classification != None:
    agg_acc_df['Class'] = classification
    agg_acc_df = agg_acc_df[['Fitzpatrick', 'Class', 'Accuracy', 'Num_Correct', 'Num_Total', 'Aggregated']]

  # Round the values of accuracy to the specified precision
  agg_acc_df['Accuracy'] = agg_acc_df['Accuracy'].map(lambda acc: round(acc, precision))

  return agg_acc_df

Firstly, let's just analyze how the dataset did in terms of accuracy for each skin type, in terms of all classes. What is being measured here is the classification accuracy for each skin type.


$type\_n\_accuracy = \dfrac{num\_type\_n\_correct}{num\_type\_n\_total} * 100$


In [None]:
# Prepare the data into a dataframe
agg_acc_df = process_accuracies(agg_acc, f_types)

# Print a report for accuracy within each skin type
print("------ Accuracy Report on All Classes Per Fitzpatrick Skin Type ------")
agg_acc_df

### Analyzing the Model's Accuracy For Each Skin Type Across Each Class

Now, we can start analyzing the data in terms of both Skin Type and the class itself. This code will give you an accuracy for each skin type and class. If you see "NaN" for an accuracy term, that means that there was not any images for that specific type and class.

Just to be clear, here accuracy is defined as:

$type\_n\_class\_c\_accuracy = \dfrac{num\_type\_n\_class\_c\_correct}{num\_type\_n\_class\_c\_total} * 100$

In [None]:
# Prepare the Dataframe
class_agg_acc_df_list = [process_accuracies(class_agg_acc[i], f_types, cat) for i, cat in enumerate(classes)]
class_agg_acc_df = pd.concat(class_agg_acc_df_list)

# Print a report for accuracy within each skin type for each classification
print("------ Accuracy Report on Per Class and Per Fitzpatrick Skin Type ------")
class_agg_acc_df

Now, we can visualize this data in a bar chart. Firstly, we will just process the data into a more usable form.

In [33]:
# Extract the accuracy information from the dataframe
acc_list = []        # Store accuracies of type (1, 2, 3, 4, 5, 6)
acc_agg_list = []    # Store accuracies of type (1&2, 3&4, 5&6)
for class_name in classes:
  sub_df = class_agg_acc_df.loc[class_agg_acc_df['Class'] == class_name][['Aggregated', 'Accuracy']]
  sub1_df = sub_df.loc[sub_df['Aggregated'] == False]
  sub2_df = sub_df.loc[sub_df['Aggregated'] == True]

  acc_list.append((class_name, sub1_df['Accuracy'].to_numpy()))
  acc_agg_list.append((class_name, sub2_df['Accuracy'].to_numpy()))

Now, lets plot the accuracies for the non-aggregated data. Note, if you are using the DDI dataset, this data will not exist, and so it will print a warning. The below cell will only run for the SynthDerm dataset.

Note: If a bar in the chart says 0, then it had 0% accuracy. However, if no value is there, there was not any images in the dataset for that specific area, so no accuracy value exists there.

In [None]:
if len(f_types) < 6:
  print('WARNING -- CANNOT PLOT ACCURACIES FOR NON-AGGREGATED DATA -- DATA DOES NOT EXISTS')
else:
  # Plot the accuracy across each individual class and skin type
  x = np.arange(1,7)
  width = 0.8 / 6  # the width of the bars
  multiplier = 0

  fig, ax = plt.subplots(figsize=(15, 10), layout='constrained')

  for class_name, values in acc_list:
    offset = width * multiplier
    rects = ax.bar(x + offset, values, width, label=class_name)
    ax.bar_label(rects, padding=3, fontsize='large')
    multiplier += 1

  ax.set_xlabel('Fitzpatrick Skin Type', fontsize='large')
  ax.set_ylabel('Model Accuracy', fontsize='large')
  ax.set_title('Fitzpatrick Skin Type Class Accuracies')
  xtick_offset = width if len(classes) > 2 else (0.5 * width)
  ax.set_xticks(x + xtick_offset, (1, 2, 3, 4, 5, 6))
  ax.legend(loc='upper left', fontsize='xx-large')

  plt.show()

Now, let's plot the aggregated data

In [None]:
# Plot the accuracy across each individual class and skin type
x = np.arange(1,4)
width = 0.8 / 6  # the width of the bars
multiplier = 0

fig, ax = plt.subplots(figsize=(15, 10), layout='constrained')

for class_name, values in acc_agg_list:
  offset = width * multiplier
  rects = ax.bar(x + offset, values, width, label=class_name)
  ax.bar_label(rects, padding=3, fontsize='large')
  multiplier += 1

ax.set_xlabel('Fitzpatrick Skin Type', fontsize='large')
ax.set_ylabel('Model Accuracy', fontsize='large')
ax.set_title('Fitzpatrick Skin Type Class Accuracies')
xtick_offset = width if len(classes) > 2 else (0.5 * width)
ax.set_xticks(x + xtick_offset, ('1&2', '3&4', '5&6'))
ax.legend(loc='upper right', fontsize='small')

plt.show()

### ROC-AUC for Binary Classification

Since ROC-AUC is mainly for binary classification, we are only going to continue here for our binary classification models (which we are mainly using for our final models anyways).

The code below will just split the data up into the different skin types, aggregating it if specified, and calculating all of the roc-auc data. If you want to aggregate the skin types (go from {1, 2, ..., 6} to {1&2, 3&4, 5&6}), just change the value of aggregate_f_type.

In [38]:
# !!!! SET THIS TO AGGREGATE OR NOT AGGREGATE THE RESULTS !!!!
aggregate_f_type = False
f_types = np.unique(df[fitz_col].to_numpy())

# Extract labels and predictions from df and calculate fpr, tpr, and auc for
# each skin type, with/without aggregating skin types.
def get_roc_data(df, aggregate, f_types):
  temp_fitz_col = fitz_col

  # Aggregate the skin types if needed
  if aggregate and len(f_types) == 6:
    fitz_agg_dict = {
        1: '1&2',
        2: '1&2',
        3: '3&4',
        4: '3&4',
        5: '5&6',
        6: '5&6'
    }
    temp_fitz_col = 'agg_fitz'
    df[temp_fitz_col] = df[fitz_col].map(fitz_agg_dict.get)
    f_types = np.unique(df[temp_fitz_col].to_numpy())

  # For each skin type, get the fpr, tpr, and auc
  roc_data = []
  f_types = np.unique(df[temp_fitz_col].to_numpy())
  for f_type in f_types:
    sub_df = df[df[fitz_col] == f_type]
    labels = sub_df['label_ind'].to_numpy()
    preds = sub_df['cont_prediction'].to_numpy()
    try:
      fpr, tpr, threshold = roc_curve(labels, preds)
      auc_val = auc(fpr, tpr)
      roc_data.append((fpr, tpr, auc_val, f_type))
    except:
      pass

  return roc_data

roc_data = get_roc_data(df, aggregate_f_type, f_types)

Now that we have calculated all of our ROC-AUC data, we can go ahead an graph it.

In [None]:
# Set the name of the dataset used for the graph title
ds = 'All DDI Images'
title = 'ROC curve - ' + ds + ' - Agg. Fitzpatrick Skin Type'

# Plot the ROC-AUC graphs
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
for fpr, tpr, auc_val, fitz_type in roc_data:
  plt.plot(fpr, tpr, label=f'Fitz Type: {fitz_type} (area = {auc_val:.3f})')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title(title)
plt.legend(loc='best')
plt.show()

### Analyzing the Confusion Matrices, F1 Scores, and Classification Reports For Each Skin Type

Lastly, we are going to generate a confusion matrix for each skin type. This way, we can compare things like precision and recall for different skin types.

The cell below is just a function which will generate the confusion matrix itself. Below that, we will actually generate each of the plots.

In [41]:
# Returns a confusion matrix
def get_agg_confusion_matrix(df, fitz_type=None):
  '''
  Returns a sklearn confusion matrix object for a specific skin type.

  Parameters:
    df - DataFrame - The dataframe with all metadata, labels, and predictions

    fitz_type - int - The fitzpatrick skin type to return the aggregated information of.
                      If None, returns all skin type information

  Returns:
    A sklearn confusion matrix object.
  '''
  # Filter the data based on the skin type
  sub_df = df
  if fitz_type:
    sub_df = sub_df[sub_df[fitz_col] == fitz_type]

  # Create the confusion matrix
  labels = sub_df['label_ind'].to_numpy()
  preds = sub_df['prediction'].to_numpy()
  cm = sklearn.metrics.confusion_matrix(labels, preds)
  return cm

Now, we can use the above function to create and plot each confusion matrix.

In [None]:
# Get a list of all available fitzpatrick skin types
fitz_types = list(np.unique(df[fitz_col].to_numpy()))

# For each skin type, generate and plot the confusion matrix
for i, fitz_type in enumerate(fitz_types):
  cm = get_agg_confusion_matrix(df, fitz_type)
  disp = sklearn.metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
  disp.plot()
  plt.title('Confusion Matrix - FST (Orig): ' + str(fitz_type))
  plt.show()

Now, we can do basically the same thing as we did above, but just calculate the F1 scores instead.

The below function will simply calculate the F1 score for a specified skin type.

In [44]:
def get_agg_F1_score(df, fitz_type=None):
  '''
  Returns the F1 score for a skin type.

  Parameters:
    df - DataFrame - The dataframe with all metadata, labels, and predictions

    fitz_type - int - The fitzpatrick skin type to return the aggregated information of.
                      If None, returns all skin type information

  Returns:
    The F1 score.
  '''
  # Filter the data based on the skin type
  sub_df = df
  if fitz_type:
    sub_df = sub_df[sub_df[fitz_col] == fitz_type]

  # Create the confusion matrix
  labels = sub_df['label_ind'].to_numpy()
  preds = sub_df['prediction'].to_numpy()
  f1_score = sklearn.metrics.f1_score(labels, preds)
  return f1_score

Now we can calculate and print out each F1 score

In [None]:
# Specify the precision to round F1 scores to
prec = 2

# Calculate and print each F1 score
for fitz_type in fitz_types:
  f1 = get_agg_F1_score(df, fitz_type)
  f1 = round(f1, prec)
  print('F1 Score - (Orig) FST ' + str(fitz_type) + ': ' + str(f1))

Lastly, we will repeat this process for the classification reports.

In [46]:
def get_agg_classification_report(df, fitz_type=None):
  '''
  Returns the classification report for a skin type.

  Parameters:
    df - DataFrame - The dataframe with all metadata, labels, and predictions

    fitz_type - int - The fitzpatrick skin type to return the aggregated information of.
                      If None, returns all skin type information

  Returns:
    The classification report.
  '''
  # Filter the data based on the skin type
  sub_df = df
  if fitz_type:
    sub_df = sub_df[sub_df[fitz_col] == fitz_type]

  # Create the confusion matrix
  labels = sub_df['label_ind'].to_numpy()
  preds = sub_df['prediction'].to_numpy()
  f1_score = sklearn.metrics.classification_report(labels, preds)
  return f1_score

Here, we will generate and print the classification report for each skin type.

In [None]:
# Generate and print the classification reports for each skin type
for fitz_type in fitz_types:
  cf = get_agg_classification_report(df, fitz_type)
  print('FST' + str(fitz_type) + ': ')
  print(cf, end='\n\n')

# **Visually Examining Model Accuracy**

As we discussed in our meeting, one of the last things that we are going to do in analyzing our model, is to visually analyze the pictures to see which ones are classified correctly, and which are not.

To do this, we are going to print out different subsets of images and see if we can find a pattern among them (ex: all images with hair are incorrectly classified).

Firstly, we're just going to add some shortened names to the dataframe that will help out when printing this data.

In [None]:
shorten_dict = {
    'Benign': 'Ben',
    'Malignant': 'Mal'
}

pred_dict = {
    0: 'Ben',
    1: 'Mal'
}

df['short_label'] = df['label'].map(shorten_dict.get)
df['short_pred_label'] = df['prediction'].map(pred_dict.get)

df.head()

Now, the below function will return a filtered version of the provided dataframe. We can use this to analyze different subsets of our dataset.

In [48]:
# Create a DataFrame subset given the specified parameters
def get_df_subset(_df, label_inds=None, fitz=None, correct=None, fp=False, fn=False):
  '''
  Returns a new filtered dataframe, given the specified parameters.

  Parameters:
    _df - DataFrame - The dataframe containing the image, and all associated metadata

    label_inds - List[int] - A list of integer labels to return. If none, all labels are returned

    fitz - List[] - A list of strings or integers which match the fitzpatrick labels. If none, all
                    fitzpatrick types are returned

    correct - bool - If true, only return images predicted correctly. If false, only return images not
                     predicted correctly. If None, no filtering occurs.
                     Only set to true if fp and fn are not set. If set to False (and fp and fn == false)
                     both false positives and false negatives are returned.

    fp - bool - If True, only return images that had false positive classifications. If False, no filtering
                occurs.
                Only set to true if correct and fn are not set.

    fn - bool - If True only return images that had false negative classifications. If False, no filtering
                occurs.
                Only set to true if correct and fp are not set.

  Returns:
    df - DataFrame - A filtered version of the original Dataframe
  '''
  assert len(_df) > 0, "ERROR - DataFrame does not have any tuples"
  df = _df

  # Sort label indexes
  if label_inds:
    df = df[df['label_ind'].isin(label_inds)]

  # Sort Fitzpatrick Types
  if fitz:
    df = df[df[fitz_col].isin(fitz)]

  # Sort correct classifications
  if correct == True:
    assert (fp == False) and (fn == False), "ERROR - Cannot filter false positives, false negativesn and true positives/negatives at the same time"
    df = df[df['label_ind'] == df['prediction']]
  elif correct == False:
    df = df[df['label_ind'] != df['prediction']]

  # Sort false positives
  if fp:
    assert (not correct) and (fn == False), "ERROR - Cannot filter false positives, false negativesn and true positives/negatives at the same time"
    df = df[df['label_ind'] == 0]
    df = df[df['prediction'] == 1]

  # Sort false negatives
  if fn:
    assert (not correct) and (fp == False), "ERROR - Cannot filter false positives, false negativesn and true positives/negatives at the same time"
    df = df[df['label_ind'] == 1]
    df = df[df['prediction'] == 0]

  print(f"Original DataFrame Size: {len(_df)}\tNew DataFrame Size: {len(df)}")
  return df

Next, we can define a function that will print a series of images to the console. We will use this in order to print the results from the function above.

In [50]:
# Print a list of images with their label, prediction, id, and fitzpatrick info
def print_images(paths, labels, predictions, ids, fitz):
  num_rows = int(len(paths) / 5) + 1
  fig_len = num_rows * 5
  fig = plt.figure(figsize=(15, fig_len))
  fig.tight_layout()

  new_img_size = (224, 224)
  print("LEGEND: LABEL PREDICTION")

  for i in range(len(paths)):
    ax = fig.add_subplot(num_rows, 5, i + 1)
    img = tf.io.read_file(paths[i])
    img = tf.image.decode_image(img, channels=3, expand_animations=False)
    img = tf.image.resize(img, new_img_size)
    img = tf.cast(img, 'int64')
    ax.imshow(img)
    plt.axis('off')
    plt.title(str(labels[i]) + " " + str(preds[i]))
    plt.annotate(ids[i], (0.01, 0))
    plt.annotate(fitz[i], (200, 0.01))

Now we can finally print out a bunch of images to visually examine how our model is doing. This is handled automatically. All that you will need to do is specify what kind of images you would like to print out. Read the documentation for the get_df_subset function to get more information on how to use it to print the images that you want. After that function is run and you get a new subset of images, you will run the print_images function to actually print them out.


For each image, it will print out 4 bits of information.

The biggest letters are the the label and prediction. In this order:

  LABEL PREDICTION

  Ex: Mal Mal

The small numbers on the top left of the images are the ids, so that you can pick them out if you want to analyze them in more depth.

The small numbers on the top right are the Fitzpatrick skin type information for that image.

To use this section, change the parameters to the get_df_subset function in order to analyze different parts of the dataset.

As a warning, because we are printing out so many images, a large subset of images might take a while to print out.

In [None]:
# Get a subset of images from the original dataset
subset_df = get_df_subset(df, fitz=[56])

# !!!! DO NOT CHANGE THE BELOW CODE !!!! #
# Extract the information necessary for printing
paths = subset_df['path'].to_numpy()
labels = subset_df['short_label'].to_numpy()
preds = subset_df['short_pred_label'].to_numpy()
ids = subset_df['DDI_ID'].to_numpy()
skin_tones = subset_df['skin_tone'].to_numpy()

# Print the subset of images
print_images(paths, labels, preds, ids, skin_tones)