Project Group : GD1 <br>
Project Name : Melanoma Skin Detection using DCNN

 ### First, import all libraries that used in this project

This code block contains necessary imports for the project, including Python libraries such as os, cv2, matplotlib, numpy, and pandas, as well as PyTorch and sklearn libraries. It also sets the random seeds for reproducibility and prints the list of directories in the input folder.

In [None]:
%matplotlib inline
# python libraties
import os, cv2,itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tqdm import tqdm
from glob import glob
from PIL import Image

# pytorch libraries
import torch
from torch import optim,nn
from torch.autograd import Variable
from torch.utils.data import DataLoader,Dataset
from torchvision import models,transforms

# sklearn libraries
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# to make the results are reproducible
np.random.seed(10)
torch.manual_seed(10)
torch.cuda.manual_seed(10)

print(os.listdir("../input"))

['hmnist_8_8_RGB.csv', 'hmnist_28_28_RGB.csv', 'HAM10000_images_part_1', 'ham10000_images_part_1', 'hmnist_8_8_L.csv', 'HAM10000_images_part_2', 'ham10000_images_part_2', 'hmnist_28_28_L.csv', 'HAM10000_metadata.csv']


## Step 1. Data analysis and preprocessing

Get the all image data paths， match the row information in HAM10000_metadata.csv with its corresponding image

This code block defines the path to the dataset directory and creates a dictionary `imageid_path_dict` where the keys are the image ids and the values are the paths to the corresponding images. The `lesion_type_dict` is another dictionary that maps the abbreviations for different types of skin lesions to their full names. The purpose of creating these dictionaries is to simplify the process of accessing and organizing the dataset during training and evaluation. The `glob` function is used to search for all `.jpg` files within subdirectories of the `data_dir`.

In [None]:
data_dir = '../input'
all_image_path = glob(os.path.join(data_dir, '*', '*.jpg'))
imageid_path_dict = {os.path.splitext(os.path.basename(x))[0]: x for x in all_image_path}
lesion_type_dict = {
    'nv': 'Melanocytic nevi',
    'mel': 'dermatofibroma',
    'bkl': 'Benign keratosis-like lesions ',
    'bcc': 'Basal cell carcinoma',
    'akiec': 'Actinic keratoses',
    'vasc': 'Vascular lesions',
    'df': 'Dermatofibroma'
}

This function is used to compute the mean and standard deviation on the whole dataset, will use for inputs normalization

This function `compute_img_mean_std()` takes a list of image paths as input and computes the mean and standard deviation of each channel (RGB) of the images in the list. It does this by resizing each image to 224x224 pixels, stacking them in a numpy array, normalizing the pixel values to be between 0 and 1, and then computing the mean and standard deviation for each channel. The function then prints the mean and standard deviation for each channel and returns them as a tuple. This function can be used to normalize the images before training a neural network.

In [None]:
def compute_img_mean_std(image_paths):
    """Computes the mean and std of the three channels on the whole dataset."""
    img_h, img_w = 224, 224
    imgs = [cv2.resize(cv2.imread(img_path), (img_h, img_w)) for img_path in tqdm(image_paths)]
    imgs = np.stack(imgs, axis=3).astype(np.float32) / 255.
    means, stdevs = [], []
    for i in range(3):
        pixels = imgs[:, :, i].astype(np.float32)
        means.append(np.mean(pixels, axis=(0,1)))
        stdevs.append(np.std(pixels, axis=(0,1)))
    means, stdevs = means[::-1], stdevs[::-1]  # convert BGR to RGB
    print("normMean = {}".format(means))
    print("normStd = {}".format(stdevs))
    return means, stdevs

Return the mean and std of RGB channels

In [None]:
#norm_mean,norm_std = compute_img_mean_std(all_image_path)

Add three columns to the original DataFrame, path (image path), cell_type (the whole name),cell_type_idx (the corresponding index  of cell type, as the image label )

This code reads the metadata CSV file containing information about the images and adds three columns to it: 'path', 'cell_type', and 'cell_type_idx'. 

'path' column stores the path of the image file for each image id.

'cell_type' column maps the abbreviation of each type of skin lesion to its full name using the 'lesion_type_dict' dictionary.

'cell_type_idx' column stores the categorical codes for the skin lesion types obtained from the 'cell_type' column using the 'pd.Categorical' method.

The resulting pandas DataFrame 'df_original' contains the metadata of all images in the dataset.

In [None]:
df_original = pd.read_csv(os.path.join(data_dir, 'HAM10000_metadata.csv'))
df_original['path'] = df_original['image_id'].map(imageid_path_dict.get)
df_original['cell_type'] = df_original['dx'].map(lesion_type_dict.get)
df_original['cell_type_idx'] = pd.Categorical(df_original['cell_type']).codes
df_original.head()

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization,path,cell_type,cell_type_idx
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp,../input/ham10000_images_part_1/ISIC_0027419.jpg,Benign keratosis-like lesions,2
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp,../input/ham10000_images_part_1/ISIC_0025030.jpg,Benign keratosis-like lesions,2
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp,../input/ham10000_images_part_1/ISIC_0026769.jpg,Benign keratosis-like lesions,2
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp,../input/ham10000_images_part_1/ISIC_0025661.jpg,Benign keratosis-like lesions,2
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear,../input/ham10000_images_part_2/ISIC_0031633.jpg,Benign keratosis-like lesions,2


This code block is filtering out images that are associated with lesion_id's that have only one image. This is done to remove duplicates and make sure that each image is associated with a unique lesion_id. The resulting dataframe `df_undup` contains information about lesion_id's that have only one image associated with it. The `reset_index` method is used to reset the index of the dataframe after filtering out the lesion_id's with only one image associated with it.

In [None]:
# this will tell us how many images are associated with each lesion_id
df_undup = df_original.groupby('lesion_id').count()
# now we filter out lesion_id's that have only one image associated with it
df_undup = df_undup[df_undup['image_id'] == 1]
df_undup.reset_index(inplace=True)
df_undup.head()

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization,path,cell_type,cell_type_idx
0,HAM_0000001,1,1,1,1,1,1,1,1,1
1,HAM_0000003,1,1,1,1,1,1,1,1,1
2,HAM_0000004,1,1,1,1,1,1,1,1,1
3,HAM_0000007,1,1,1,1,1,1,1,1,1
4,HAM_0000008,1,1,1,1,1,1,1,1,1


This code defines a function `get_duplicates` that takes a lesion ID and checks whether it is present in the list of `lesion_id` values that have only one associated image in the dataset. If it is present, the function returns the string `'unduplicated'`, otherwise it returns `'duplicated'`. 

The function is then applied to a new column `duplicates` in the `df_original` dataframe, which contains the lesion ID, image ID, and other metadata for each image in the dataset. This allows us to determine whether each image is duplicated or not based on its lesion ID. 

Note that `df_undup` is a dataframe that contains only the `lesion_id` values that have a single associated image in the dataset. The function `get_duplicates` checks whether a given lesion ID is present in this list, and thus whether the image is duplicated or not.

In [None]:
# here we identify lesion_id's that have duplicate images and those that have only one image.
def get_duplicates(x):
    unique_list = list(df_undup['lesion_id'])
    if x in unique_list:
        return 'unduplicated'
    else:
        return 'duplicated'

# create a new colum that is a copy of the lesion_id column
df_original['duplicates'] = df_original['lesion_id']
# apply the function to this new column
df_original['duplicates'] = df_original['duplicates'].apply(get_duplicates)
df_original.head()

Unnamed: 0,lesion_id,image_id,dx,dx_type,age,sex,localization,path,cell_type,cell_type_idx,duplicates
0,HAM_0000118,ISIC_0027419,bkl,histo,80.0,male,scalp,../input/ham10000_images_part_1/ISIC_0027419.jpg,Benign keratosis-like lesions,2,duplicated
1,HAM_0000118,ISIC_0025030,bkl,histo,80.0,male,scalp,../input/ham10000_images_part_1/ISIC_0025030.jpg,Benign keratosis-like lesions,2,duplicated
2,HAM_0002730,ISIC_0026769,bkl,histo,80.0,male,scalp,../input/ham10000_images_part_1/ISIC_0026769.jpg,Benign keratosis-like lesions,2,duplicated
3,HAM_0002730,ISIC_0025661,bkl,histo,80.0,male,scalp,../input/ham10000_images_part_1/ISIC_0025661.jpg,Benign keratosis-like lesions,2,duplicated
4,HAM_0001466,ISIC_0031633,bkl,histo,75.0,male,ear,../input/ham10000_images_part_2/ISIC_0031633.jpg,Benign keratosis-like lesions,2,duplicated


In [None]:
df_original['duplicates'].value_counts()

unduplicated    5514
duplicated      4501
Name: duplicates, dtype: int64

In [None]:
# now we filter out images that don't have duplicates
df_undup = df_original[df_original['duplicates'] == 'unduplicated']
df_undup.shape

(5514, 11)

In [None]:
# now we create a val set using df because we are sure that none of these images have augmented duplicates in the train set
y = df_undup['cell_type_idx']
_, df_val = train_test_split(df_undup, test_size=0.2, random_state=101, stratify=y)
df_val.shape

(1103, 11)

In [None]:
df_val['cell_type_idx'].value_counts()

4    883
2     88
6     46
1     35
0     30
5     13
3      8
Name: cell_type_idx, dtype: int64

This code defines a function `get_val_rows` that takes an image ID and returns whether it is in the validation or training set. It then applies this function to the `image_id` column of `df_original` and stores the result in a new column called `train_or_val`.

It then filters out the rows of `df_original` that correspond to the training set by creating a new dataframe `df_train` that only includes the rows where `train_or_val` is equal to `'train'`. It also prints out the number of rows in `df_train` and `df_val`.

In [None]:
# This set will be df_original excluding all rows that are in the val set
# This function identifies if an image is part of the train or val set.
def get_val_rows(x):
    # create a list of all the lesion_id's in the val set
    val_list = list(df_val['image_id'])
    if str(x) in val_list:
        return 'val'
    else:
        return 'train'

# identify train and val rows
# create a new colum that is a copy of the image_id column
df_original['train_or_val'] = df_original['image_id']
# apply the function to this new column
df_original['train_or_val'] = df_original['train_or_val'].apply(get_val_rows)
# filter out train rows
df_train = df_original[df_original['train_or_val'] == 'train']
print(len(df_train))
print(len(df_val))

8912
1103


In [None]:
df_train['cell_type_idx'].value_counts()

4    5822
6    1067
2    1011
1     479
0     297
5     129
3     107
Name: cell_type_idx, dtype: int64

In [None]:
df_val['cell_type'].value_counts()

Melanocytic nevi                  883
Benign keratosis-like lesions      88
dermatofibroma                     46
Basal cell carcinoma               35
Actinic keratoses                  30
Vascular lesions                   13
Dermatofibroma                      8
Name: cell_type, dtype: int64

**From From the above statistics of each category, we can see that there is a serious class imbalance in the training data. To solve this problem, I think we can start from two aspects, one is equalization sampling, and the other is a loss function that can be used to mitigate category imbalance during training, such as focal loss.**

This code snippet is performing data augmentation by creating more samples for some of the classes that have fewer samples compared to the other classes. It does this by copying some of the existing samples in the training set multiple times. The number of copies to be made for each class is specified by the list `data_aug_rate`, which has 7 elements (one for each class) that represent the number of copies to be made for each class. 

For example, `data_aug_rate = [15,10,5,50,0,40,5]` means that for the first class (Melanocytic nevi), 15 additional copies will be made, for the second class (dermatofibroma), 10 additional copies will be made, for the third class (Benign keratosis-like lesions), 5 additional copies will be made, and so on. 

The `append` method is used to append the selected samples to the existing training set for each class. The `ignore_index=True` argument ensures that the index values of the appended rows are reset to avoid duplicate index values. 

Finally, the `value_counts` method is used to verify that the number of samples for each class is now balanced.

In [None]:
# Copy fewer class to balance the number of 7 classes
data_aug_rate = [15,10,5,50,0,40,5]
for i in range(7):
    if data_aug_rate[i]:
        df_train=df_train.append([df_train.loc[df_train['cell_type_idx'] == i,:]]*(data_aug_rate[i]-1), ignore_index=True)
df_train['cell_type'].value_counts()

Melanocytic nevi                  5822
Dermatofibroma                    5350
dermatofibroma                    5335
Vascular lesions                  5160
Benign keratosis-like lesions     5055
Basal cell carcinoma              4790
Actinic keratoses                 4455
Name: cell_type, dtype: int64

At the beginning, I divided the data into three parts, training set, validation set and test set. Considering the small amount of data, I did not further divide the validation set data in practice.

In [None]:
# # We can split the test set again in a validation set and a true test set:
# df_val, df_test = train_test_split(df_val, test_size=0.5)
df_train = df_train.reset_index()
df_val = df_val.reset_index()
# df_test = df_test.reset_index()

## Step 2. Model building

This function takes a PyTorch model and a boolean value indicating whether to perform feature extraction or fine-tuning. If `feature_extracting` is set to `True`, all parameters in the model are set to not require gradients (`param.requires_grad = False`) so that only the newly added layers require gradients during training. If `feature_extracting` is set to `False`, all parameters in the model require gradients and can be fine-tuned during training.

In [None]:
# feature_extract is a boolean that defines if we are finetuning or feature extracting. 
# If feature_extract = False, the model is finetuned and all model parameters are updated. 
# If feature_extract = True, only the last layer parameters are updated, the others remain fixed.
def set_parameter_requires_grad(model, feature_extracting):
    if feature_extracting:
        for param in model.parameters():
            param.requires_grad = False

This function initializes a CNN model for transfer learning using a pre-trained model. The function takes four arguments:

- `model_name`: A string that specifies which pre-trained model to use. Currently, only the "densenet" model is implemented.
- `num_classes`: An integer that specifies the number of output classes of the final classification layer.
- `feature_extract`: A boolean that specifies whether to fine-tune the entire model or only the final layer.
- `use_pretrained`: A boolean that specifies whether to use the pre-trained weights for the model.

The function initializes the specified pre-trained model, sets the `requires_grad` parameter of the layers to be fine-tuned according to the `feature_extract` argument, replaces the final classification layer with a new fully connected layer with the number of output classes specified by the `num_classes` argument, and returns the modified model and the input size of the model (which is fixed at 224 for all pre-trained models currently implemented).

In [None]:
def initialize_model(model_name, num_classes, feature_extract, use_pretrained=True):
    # Initialize these variables which will be set in this if statement. Each of these
    #   variables is model specific.
    model_ft = None
    input_size = 0

    if model_name == model_name == "densenet":
        """ Densenet121
        """
        model_ft = models.densenet121(pretrained=use_pretrained)
        '''model_ft.fc = nn.Sequential(
        nn.Linear(num_classes, 1024),
        nn.BatchNorm1d(1024),
        nn.ReLU(inplace=True),
        nn.Linear(1024, 512),
        nn.BatchNorm1d(512),
        nn.ReLU(inplace=True),
        nn.Linear(512, 256),
        nn.BatchNorm1d(256),
        nn.ReLU(inplace=True),
        nn.Linear(256, 128),
        nn.BatchNorm1d(128),
        nn.ReLU(inplace=True),
        nn.Linear(128, num_classes))'''
        set_parameter_requires_grad(model_ft, feature_extract)
        num_ftrs = model_ft.classifier.in_features
        model_ft.classifier = nn.Linear(num_ftrs, num_classes)
        input_size = 224

    elif model_name == "inception":
        """ Inception v3
        Be careful, expects (299,299) sized images and has auxiliary output
        """
        model_ft = models.inception_v3(pretrained=use_pretrained)
        set_parameter_requires_grad(model_ft, feature_extract)
        # Handle the auxilary net
        num_ftrs = model_ft.AuxLogits.fc.in_features
        model_ft.AuxLogits.fc = nn.Linear(num_ftrs, num_classes)
        # Handle the primary net
        num_ftrs = model_ft.fc.in_features
        model_ft.fc = nn.Linear(num_ftrs,num_classes)
        input_size = 299

    else:
        print("Invalid model name, exiting...")
        exit()
    return model_ft, input_size

You can change your backbone network, here are 4 different networks, each network also has sevaral versions. Considering the limited training data, we used the ImageNet pre-training model for fine-tuning. This can speed up the convergence of the model and improve the accuracy.

There is one thing you need to pay attention to, the input size of Inception is different from the others (299x299), you need to change the setting of compute_img_mean_std() function 

Sure, here is the documentation for the code above:

- `model_name`: a string variable that specifies the name of the model to be used for transfer learning. It can be 'resnet', 'vgg', 'densenet', or 'inception'.
- `num_classes`: an integer variable that specifies the number of classes in the dataset.
- `feature_extract`: a boolean variable that specifies whether to finetune all layers of the model or only the top layer.
- `model_ft`: a variable that holds the pretrained model.
- `input_size`: an integer variable that specifies the input size of the model.
- `initialize_model`: a function that initializes the specified model for transfer learning by changing its classifier layer to output the desired number of classes.
- `device`: a variable that specifies the device (CPU or GPU) to use for training the model.
- `model`: a variable that holds the model to be used for training. It is moved to the specified device.

In [None]:
# resnet,vgg,densenet,inception
model_name = 'inception'
num_classes = 7
feature_extract = False
# Initialize the model for this run
model_ft, input_size = initialize_model(model_name, num_classes, feature_extract, use_pretrained=True)
# Define the device:
device = torch.device('cuda:0')
# Put the model on the device:
model = model_ft.to(device)

Downloading: "https://download.pytorch.org/models/inception_v3_google-1a9a5a14.pth" to /root/.torch/models/inception_v3_google-1a9a5a14.pth
108857766it [00:00, 111340717.65it/s]


Here are the documentations for the code above:

- `norm_mean` and `norm_std` are the mean and standard deviation values for each color channel (RGB) of the input images. These values are used to normalize the input images during the training and validation process.

- `train_transform` is a composition of several image transformation techniques that will be applied to the training set of images. These techniques include resizing the image to `input_size`, random horizontal and vertical flips, random rotations up to 20 degrees, color jittering, and normalization using the `norm_mean` and `norm_std` values.

- `val_transform` is similar to `train_transform`, but it only applies resizing and normalization to the validation set of images.

In [None]:
norm_mean = (0.49139968, 0.48215827, 0.44653124)
norm_std = (0.24703233, 0.24348505, 0.26158768)
# define the transformation of the train images.
train_transform = transforms.Compose([transforms.Resize((input_size,input_size)),transforms.RandomHorizontalFlip(),
                                      transforms.RandomVerticalFlip(),transforms.RandomRotation(20),
                                      transforms.ColorJitter(brightness=0.1, contrast=0.1, hue=0.1),
                                        transforms.ToTensor(), transforms.Normalize(norm_mean, norm_std)])
# define the transformation of the val images.
val_transform = transforms.Compose([transforms.Resize((input_size,input_size)), transforms.ToTensor(),
                                    transforms.Normalize(norm_mean, norm_std)])

This is a custom PyTorch dataset class called HAM10000 that is defined to load the data for the HAM10000 dataset. It takes a dataframe and an optional transformation as input. 

- `__init__(self, df, transform=None)` : Initializes the class with the dataframe and the transformation.

- `__len__(self)` : Returns the length of the dataset.

- `__getitem__(self, index)` : Returns a tuple of the image data and its corresponding label. It takes an index as input and returns the transformed image and its corresponding label as a PyTorch tensor.

In [None]:
# Define a pytorch dataloader for this dataset
class HAM10000(Dataset):
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        # Load data and get label
        X = Image.open(self.df['path'][index])
        y = torch.tensor(int(self.df['cell_type_idx'][index]))

        if self.transform:
            X = self.transform(X)

        return X, y

Here we define the training and validation sets using the table train_df and our defined transformations (train_transform and val_transform). We use the PyTorch DataLoader to create batches of images to feed to the model during training and validation. We set the batch size to 32, shuffle the training set, and do not shuffle the validation set. We also set the number of workers to 4 to parallelize the data loading process.

In [None]:
# Define the training set using the table train_df and using our defined transitions (train_transform)
training_set = HAM10000(df_train, transform=train_transform)
train_loader = DataLoader(training_set, batch_size=32, shuffle=True, num_workers=4)
# Same for the validation set:
validation_set = HAM10000(df_val, transform=train_transform)
val_loader = DataLoader(validation_set, batch_size=32, shuffle=False, num_workers=4)

This code initializes an Adam optimizer with a learning rate of 0.001 and sets the cross entropy loss as the loss function. The model.parameters() method is used to retrieve the parameters of the model that need to be optimized. The optimizer will be used to update these parameters during training to minimize the loss function. The to(device) method is used to move the loss function to the same device as the model, which is typically a GPU.

In [None]:
# we use Adam optimizer, use cross entropy loss as our loss function
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss().to(device)

## Step 3. Model training

The `AverageMeter` class is used during training process to calculate the loss and accuracy. It has four attributes: `val`, `avg`, `sum`, and `count`. `val` is the most recent value that has been calculated, `sum` is the sum of all the values calculated so far, `count` is the number of values calculated so far, and `avg` is the average of all the values calculated so far. It has three methods: `reset`, `update`, and `__init__`. The `reset` method resets all attributes to their initial values. The `update` method updates the values of `val`, `sum`, `count`, and `avg` based on a new input value `val`. The `__init__` method initializes all attributes to their initial values by calling the `reset` method.

In [None]:
# this function is used during training process, to calculation the loss and accuracy
class AverageMeter(object):
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

The `train` function is used during the training process to update the weights of the model using the optimizer and the loss function. It takes in the training data loader, the model, the loss function, the optimizer, and the current epoch number as inputs. Inside the function, for each batch of data, it loads the images and labels and performs forward propagation through the model to get the outputs. Then, it calculates the loss based on the predicted outputs and the actual labels, and backpropagates the loss through the model to calculate the gradients. Finally, it updates the model parameters using the optimizer based on the calculated gradients. The function also calculates and returns the average loss and accuracy for the current epoch.

In [None]:
total_loss_train, total_acc_train = [],[]
def train(train_loader, model, criterion, optimizer, epoch):
    model.train()
    train_loss = AverageMeter()
    train_acc = AverageMeter()
    curr_iter = (epoch - 1) * len(train_loader)
    for i, data in enumerate(train_loader):
        images, labels = data
        N = images.size(0)
        # print('image shape:',images.size(0), 'label shape',labels.size(0))
        images = Variable(images).to(device)
        labels = Variable(labels).to(device)

        optimizer.zero_grad()
        outputs = model(images)

        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        prediction = outputs.max(1, keepdim=True)[1]
        train_acc.update(prediction.eq(labels.view_as(prediction)).sum().item()/N)
        train_loss.update(loss.item())
        curr_iter += 1
        if (i + 1) % 100 == 0:
            print(f"[epoch {epoch}], [iter {i + 1}/{len(train_loader)}], "f"[train loss {train_loss.avg:.5f}], "f"[train acc {train_acc.avg:.5f}]")
            total_loss_train.append(train_loss.avg)
            total_acc_train.append(train_acc.avg)
    return train_loss.avg, train_acc.avg

The `validate` function is used to evaluate the model on the validation set after each training epoch. It takes in the validation dataloader, the model, the loss criterion, the optimizer, and the current epoch as input. 

Within the function, the model is set to evaluation mode using `model.eval()`, which turns off the gradient computation and activates the evaluation mode for any layers that have different behavior during training and evaluation. Then, we iterate through the validation dataloader and compute the validation loss and accuracy for each batch. We update the `val_loss` and `val_acc` using the `AverageMeter()` function. 

Finally, we print the validation loss and accuracy, and return these values.

In [None]:
def validate(val_loader, model, criterion, optimizer, epoch):
    model.eval()
    val_loss = AverageMeter()
    val_acc = AverageMeter()
    with torch.no_grad():
        for i, data in enumerate(val_loader):
            images, labels = data
            N = images.size(0)
            images = Variable(images).to(device)
            labels = Variable(labels).to(device)

            outputs = model(images)
            prediction = outputs.max(1, keepdim=True)[1]

            val_acc.update(prediction.eq(labels.view_as(prediction)).sum().item()/N)

            val_loss.update(criterion(outputs, labels).item())

    print('------------------------------------------------------------')
    print('[epoch %d], [val loss %.5f], [val acc %.5f]' % (epoch, val_loss.avg, val_acc.avg))
    print('------------------------------------------------------------')
    return val_loss.avg, val_acc.avg

This code segment trains a PyTorch model on a given dataset for a specified number of epochs. 

`epoch_num` specifies the number of epochs to train for. `best_val_acc` is a variable that is used to keep track of the highest validation accuracy achieved during training. `total_loss_val` and `total_acc_val` are empty lists that will be used to store the validation loss and accuracy for each epoch. 

`model` is moved to the device (e.g. GPU) outside of the loop. This is done to prevent repeatedly moving the model to the device for each epoch, which can be time-consuming. 

The loop iterates over the number of epochs specified by `epoch_num`. For each epoch, the `train` function is called with the training data, model, criterion, optimizer, and epoch number as arguments. The `train` function returns the average training loss and accuracy for that epoch. 

After the training is complete for an epoch, the `validate` function is called with the validation data, model, criterion, optimizer, and epoch number as arguments. The `validate` function calculates the validation loss and accuracy for the current epoch. `torch.no_grad()` is used during validation to disable gradient computation, which can save time and memory. 

The validation loss and accuracy for the current epoch are then appended to `total_loss_val` and `total_acc_val` lists, respectively. If the current validation accuracy is higher than the previous best accuracy (`best_val_acc`), the `best_val_acc` variable is updated, and a message is printed indicating that a new best accuracy has been achieved. 

Finally, `torch.cuda.empty_cache()` is called to release GPU memory after each epoch, which can help prevent out-of-memory errors.

In [None]:
epoch_num = 50
best_val_acc = 0
total_loss_val, total_acc_val = [],[]
model = model.to(device) # move the model to the device outside the loops
for epoch in range(1, epoch_num+1):
    train_loss, train_acc = train(train_loader, model, criterion, optimizer, epoch)
    with torch.no_grad(): # use torch.no_grad() during validation
        val_loss, val_acc = validate(val_loader, model, criterion, optimizer, epoch)
    total_loss_val.append(val_loss)
    total_acc_val.append(val_acc)
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        filename_state = f'model_{epoch}ch__ValAcc_{val_acc}_state.pth'
        torch.save(model.state_dict(), filename_state)
        filename = f'model_{epoch}ch__ValAcc_{val_acc}.pth'
        torch.save(model, filename)
        print('*****************************************************')
        print('best record: [epoch %d], [val loss %.5f], [val acc %.5f]' % (epoch, val_loss, val_acc))
        print('*****************************************************')
    torch.cuda.empty_cache() # release GPU memory after each epoch

## Step 4. Model evaluation

This code segment generates a plot of training and validation loss/accuracy over multiple epochs using the Matplotlib library. 

- `ax.plot(total_loss_train, label='training loss')` plots the training loss on the first y-axis (`ax`). `total_loss_train` is a list of losses obtained during training. The label `training loss` is used to identify this line in the legend.
- `ax.plot(total_acc_train, label='training accuracy')` plots the training accuracy on the first y-axis (`ax`). `total_acc_train` is a list of accuracies obtained during training. The label `training accuracy` is used to identify this line in the legend.
- `ax.set_ylabel('training loss/accuracy')` sets the y-axis label for `ax` to `'training loss/accuracy'`.
- `ax2 = ax.twinx()` creates a second y-axis (`ax2`) that shares the same x-axis as `ax`.
- `ax2.plot(total_loss_val, label='validation loss', color='orange')` plots the validation loss on the second y-axis (`ax2`). `total_loss_val` is a list of losses obtained during validation. The label `validation loss` is used to identify this line in the legend. The line color is set to orange using `color='orange'`.
- `ax2.plot(total_acc_val, label='validation accuracy', color='green')` plots the validation accuracy on the second y-axis (`ax2`). `total_acc_val` is a list of accuracies obtained during validation. The label `validation accuracy` is used to identify this line in the legend. The line color is set to green using `color='green'`.
- `ax2.set_ylabel('validation loss/accuracy')` sets the y-axis label for `ax2` to `'validation loss/accuracy'`.
- `ax.set_xlabel('epoch')` sets the x-axis label to `'epoch'`.
- `ax.set_title('Training and Validation Loss/Accuracy')` sets the plot title to `'Training and Validation Loss/Accuracy'`.
- `plt.legend()` displays the legend on the plot.
- `plt.tight_layout()` adjusts the spacing between the subplots to prevent overlapping.
- `plt.show()` displays the plot.

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

ax.plot(total_loss_train, label='training loss')
ax.plot(total_acc_train, label='training accuracy')
ax.set_ylabel('training loss/accuracy')
ax2 = ax.twinx()
ax2.plot(total_loss_val, label='validation loss', color='orange')
ax2.plot(total_acc_val, label='validation accuracy', color='green')
ax2.set_ylabel('validation loss/accuracy')
ax.set_xlabel('epoch')
ax.set_title('Training and Validation Loss/Accuracy')
plt.legend()
plt.tight_layout()
plt.show()

This is a Python function that takes a confusion matrix and some additional parameters as input, and generates a plot of the confusion matrix using the Matplotlib library. The parameters are:

- `cm`: the confusion matrix, represented as a 2D NumPy array.
- `classes`: a list of class names. The names should correspond to the rows and columns of the confusion matrix.
- `normalize`: a boolean indicating whether to normalize the confusion matrix. If set to `True`, each row of the confusion matrix will be divided by the total number of samples in that class. This can be useful when the classes are imbalanced.
- `title`: a string that sets the title of the plot.
- `cmap`: a colormap used to display the matrix.

The function first displays the confusion matrix using `plt.imshow()`, which generates a heatmap with colors indicating the values in the matrix. The `title` parameter is used to set the title of the plot, and `plt.colorbar()` adds a colorbar to the plot to show the values associated with each color.

The x and y axes of the plot are labeled with the `classes` list using `plt.xticks()` and `plt.yticks()`. If `normalize` is `True`, the confusion matrix is normalized using `cm.sum(axis=1)[:, np.newaxis]` to compute the fraction of correct predictions in each class. The `thresh` variable is used to set a threshold value for text color based on the maximum value in the confusion matrix.

Finally, the function sets the x and y axis labels using `plt.xlabel()` and `plt.ylabel()`, respectively, and uses `plt.tight_layout()` to adjust the spacing between the subplots to prevent overlapping.

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

This code segment evaluates the performance of a PyTorch model on a validation dataset and generates a confusion matrix plot using the `confusion_matrix` and `plot_confusion_matrix` functions defined earlier.

- `model.eval()` sets the model to evaluation mode.
- `y_label = []` initializes an empty list to store the true labels.
- `y_predict = []` initializes an empty list to store the predicted labels.
- `with torch.no_grad():` disables gradient calculation during evaluation for efficiency.
- `for i, data in enumerate(val_loader):` iterates over the validation data loader.
- `images, labels = data` unpacks the inputs and labels from the current batch.
- `N = images.size(0)` computes the batch size.
- `images = Variable(images).to(device)` converts the inputs to PyTorch variables and moves them to the specified device (e.g., CPU or GPU).
- `outputs = model(images)` feeds the inputs through the model to obtain the predicted outputs.
- `prediction = outputs.max(1, keepdim=True)[1]` obtains the predicted labels by selecting the class with the highest probability.
- `y_label.extend(labels.cpu().numpy())` appends the true labels to the `y_label` list.
- `y_predict.extend(np.squeeze(prediction.cpu().numpy().T))` appends the predicted labels to the `y_predict` list.

After the loop, the confusion matrix is computed using the `confusion_matrix` function on `y_label` and `y_predict`.

`plot_labels = ['akiec', 'bcc', 'bkl', 'df', 'nv', 'vasc', 'mel']` defines the label names for the confusion matrix plot.

Finally, the `plot_confusion_matrix` function is called to generate the plot using the `confusion_mtx` and `plot_labels` as inputs.

In [None]:
model.eval()
y_label = []
y_predict = []
with torch.no_grad():
    for i, data in enumerate(val_loader):
        images, labels = data
        N = images.size(0)
        images = Variable(images).to(device)
        outputs = model(images)
        prediction = outputs.max(1, keepdim=True)[1]
        y_label.extend(labels.cpu().numpy())
        y_predict.extend(np.squeeze(prediction.cpu().numpy().T))

# compute the confusion matrix
confusion_mtx = confusion_matrix(y_label, y_predict)
# plot the confusion matrix
plot_labels = ['akiec', 'bcc', 'bkl', 'df', 'nv', 'vasc','mel']
plot_confusion_matrix(confusion_mtx, plot_labels)

`classification_report` is a function from scikit-learn library that generates a classification report containing various evaluation metrics such as precision, recall, F1-score, and support for each class.

`y_label` and `y_predict` are the true and predicted labels, respectively.

`target_names=plot_labels` specifies the label names for each class.

The `classification_report` function returns a formatted string that contains the evaluation metrics for each class and their weighted average. This string is printed using the `print` function.

In [None]:
# Generate a classification report
report = classification_report(y_label, y_predict, target_names=plot_labels)
print(report)

This code computes the fraction of samples that are misclassified for each true label and generates a bar plot of the results.

`label_frac_error = 1 - np.diag(confusion_mtx) / np.sum(confusion_mtx, axis=1)` calculates the fraction of misclassified samples for each true label by subtracting the diagonal values of the confusion matrix (i.e., the correctly classified samples) from 1 and dividing by the total number of samples for each label.

`plt.bar(np.arange(7), label_frac_error)` creates a bar plot with 7 bars (one for each label) with the `x` axis representing the label index and the `y` axis representing the fraction of samples classified incorrectly. 

`plt.xlabel('True Label')` sets the label for the `x` axis to "True Label".

`plt.ylabel('Fraction classified incorrectly')` sets the label for the `y` axis to "Fraction classified incorrectly".

In [None]:
label_frac_error = 1 - np.diag(confusion_mtx) / np.sum(confusion_mtx, axis=1)
plt.bar(np.arange(7),label_frac_error)
plt.xlabel('True Label')
plt.ylabel('Fraction classified incorrectly')

This code saves the state dictionary of the PyTorch model to a file named `model.pth`. The `state_dict()` method returns a dictionary containing the parameters and persistent buffers of the model. The `torch.save()` function saves the dictionary to a file in a binary format that can be loaded later using the `torch.load()` function. By saving the model, we can load it later for further training or evaluation without having to retrain it from scratch.

In [None]:
# save model to file
torch.save(model.state_dict(), 'model.pth')


In [None]:
#To load the model from the file, you can use the torch.load() function as follows:
# load the saved model
#model = MyModelClass()
#model.load_state_dict(torch.load('model.pth'))

!pip install google-colab

!pip install google-colab

n = 6
from IPython import get_ipython
ipython = get_ipython()

filename = f'model_{n}.pth'
torch.save(model.state_dict(), filename)

# download model file
from IPython.display import FileLink

remote_url = FileLink(f'model_{n}.pth')

# download the file to your local machine
from google.colab import files
files.download('/kaggle/working/model.pth')

## Conclusion

I tried to train with different network structures. When using Densenet-121, the average accuracy of 7 classes on the validation set can reach 92% in 10 epochs. We also calculated the confusion matrix for all classes and the F1-score for each class, which is a more comprehensive indicator that can take into account both the precision and recall of the classification model.Our model can achieve more than 90% on the F1-score indicator.

Due to limited time, we did not spend much time on model training. By increasing in training epochs, adjustmenting of model hyperparameters, and attempting at different networks may further enhance the performance of the model.

## Next plan

How to use image data and patient case data at the same time, my plan is to use CNN to extract features from images, use xgboost to convert medical records into vectors and then concat them with CNN network full-layer features. Two branch networks are trained simultaneously using a loss function. We can refer to the methods used in the advertising CTR estimation task.