# Melanoma detection with Keras and Tensorflow

In this assignment, our primary goal is to develop a reliable and accurate convolutional neural network (CNN) to predict if the image of a mole contains the melanoma features for early detection of skin cancer. This model may help health specialists detect the first signals of emerging disease and effectively deal with it. 

To achieve the goal we set, we will use a variety of machine learning techniques that can help us develop the most accurate model we can do with the existing data, technical resources, and time constraints.

Firstly, before we go to the actual code, we will discuss the main features of the method we will use to train our model and the model itself.

## Model:

### Efficient Net B7

Image classification is a widely studied topic that already has a lot of advanced models that can obtain a high score (90+ %) on their own without adding any extra layers. As we will use the already existing model structure for our assignment, we will use **transfer learning** (a machine learning technique where a model trained on one task is re-purposed on a second related task (Brownlee, 2019))

Transfer learning is an excellent opportunity to obtain great results without reinventing the wheel and adding extra complexity to our neural network. While it is possible to create a model with the same accuracy, it will take much time unnecessarily because we already have good models to use.

EfficientNet is a group of state-of-an-art convolutional neural networks that was firstly introduced by Mingxing Tan and Quoc V. Le in their work *EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks* (2019). Even though it can provide excellent accuracy on image datasets (CIFAR-100 (91.7%), Flowers (98.8%), etc.) EfficientNet is light-weight, small, and fast compared to other popular models for image classification. For example, the simplest EfficientNet model B0 contains only 5,330,564 parameters, while ResNet-50 (one of the popular image classification models) has 23,534,592 settings and at the same time underperforms the former model (Zhang, 2019).
For example, as we can see in the graph below made by Tan and Le (2019), EfficientNets achieves the best accuracy within the other models.

![](https://raw.githubusercontent.com/tensorflow/tpu/master/models/official/efficientnet/g3doc/params.png)
<center> <i>Figure 1.</i> <b>Model Size vs. ImageNet Accuracy.</b> All numbers are for single-crop, single-model. EfficientNets significantly outperform other ConvNets. In particular, EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 accuracy but being 8.4x smaller and 6.1x faster than GPipe. EfficientNet-B1 is 7.6x smaller and 5.7x faster than ResNet-152 (Tan & Le, 2019).</center>

"The main building block for EfficientNet is **MBConv**, an inverted bottleneck conv, originally known as MobileNetV2. By using shortcuts directly between the bottlenecks, which connects a much fewer number of channels compared to expansion layers, combined with depthwise separable convolution, which effectively reduces computation by almost a factor of k<sup>2</sup>, compared to traditional layers. Where k stands for the kernel size, specifying the height and width of the 2D convolution window" (Zhang, 2019). 

EfficientNet B7 architecture is complex yet being developed by using combination of simple blocks shown on the scheme below (Agarwal, 2020).

![](https://miro.medium.com/max/2000/1*cwMpOJNhwOeosjwW-usYvA.png)
<center> <i>Figure 2.</i> <b>Bulding Blocks of EfficientNets.</b></center>


These 5 types of blocks are combined into sub-blocks (Agarwal, 2020).

![](https://miro.medium.com/max/1400/1*snN5M6WXlqHVFAwi17H9Mw.png)
<center> <i>Figure 3.</i> <b>Bulding Sub-blocks of EfficientNets.</b></center>


The overall architecture contains 813 layers, and they are organised from sub-blocks the way it shown on the figure below (Agarwal, 2020).

![](https://miro.medium.com/max/2000/1*9LkWH_LUPi5QD1k-QcUA2g.png)
<center> <i>Figure 4.</i> <b>EfficientNet B7 Architecture.</b></center>


One of the best features implemented in EfficientNets is compound scaling, which makes the process of training fast and efficient and makes it easy to use this model with high-resolution images due to the balance between different single-dimension scalings (width, resolution, depth). 

As we could see, EfficientNet B7 is a great choice for our image classification assignment because of its efficiency in terms of technical resources, time, and accuracy.

## Techniques:

In this section, we look at the techniques implemented to increase the overall accuracy and generalization of CNN. Of course, we will further explore how exactly we will use them in the code later in this paper.


### 1. K-fold cross-validation

K-fold cross-validation is the popular technique to ensure that our model can perform well on the unseen test data. It is widely used to prevent the model from overfitting (failure to generalize to the other set of points (test data) because of a too tight fit to the training data). Therefore, we need to use data points from our dataset as validation data (data we will use to calculate the accuracy of our model). 

Even though there are several ways to perform this split, they have some severe drawbacks. For example, train_test_split from the scikit-learn library divides the dataset into two random parts (one to train the model, and one to test its accuracy), thus, lacking the part of the information that is stored into test records (because they are not used for training). As we have a very imbalanced dataset, because of this split we can put all melanomas into test dataset (as the standard split 80/20, will put about 6600 images into test data, while there are only 500 melanoma images at all in the dataset (if we use original data)), so our neural network will not see the malicious images and will not be able to train from only benign ones. 

K-fold cross-validation is the tool that can help in this situation. It is based on split of the dataset into K (number of folds) folds and subsequent series of trainings on K-1 of them and testings on 1. Overally, this process can be visualized this way:
![](https://www.researchgate.net/profile/B_Aksasse/publication/326866871/figure/fig2/AS:669601385947145@1536656819574/K-fold-cross-validation-In-addition-we-outline-an-overview-of-the-different-metrics-used.jpg)
<center> <i>Figure 5.</i> <b>K-fold Cross-Validation Process.</b> (Reference 6)</center>

As we use all the records for both training and testing, we do not lack any data, and the accuracy of the model does not depend on the random train-test split. K-fold cross-validation makes the model less biased, so it will make it easier to generalize.

### 2. Extended Dataset

If we look at the original dataset ("SIIM-ISIC Melanoma Classification", 2020), it is imbalanced as the ratio of malicious images to benign is about 0.018, which means that the number of no disease pictures is tremendously higher than the number of melanoma ones.

This imbalance can bias our results a lot as even if the model predicts all images to be benign, its accuracy will be about 98%. This case will be a terrible thing because this model will be useless. Therefore, we will use an extended dataset (Reference 8) with more malicious and benign images, so our model will be able to train from more images and detect more patterns. The extended dataset contains about seventy thousand images versus thirty thousand in the original one, and the ratio of malicious images to benign is increased.

### 3. Data Augmentations

Data augmentations are an essential tool to increase the diversity of the data without actually collecting new data. It can help the model generalize better as CNN will use diverse data to learn patterns from, and it will lead to the general increase of the possible patterns the model can extract. We use different simple data augmentations in the code: rotations, flips, changes in hues, contrast, brightness, and saturation. These necessary enhancements can diversify our data, yet keep it understandable and meaningful.

### 4. TTA (Test-Time Augmentation)

It is another technique that can provide better accuracy in the testing data. Test-Time Augmentation means that we do several epochs of test data label prediction with different augmentations used for the test images and then bring the average of these predictions to the final submission file. Using TTA, we eliminate bias that can be brought by a single usage of the model as we use model multiple times on the same image with different augmentations, which also helps as they can make it easier for the neural network to recognize the pattern.

### 5. Automatic Mixed Precision

Automatic Mixed Precision is used to speed up the neural network training by changing the FP32 type to FP16 (32-bit floating-point to 16-bit). The numbers changed to 16-bit requires twice less memory, so the RAM usage is lowered. Also, the operations are much faster with FP16.

Now, we go to the actual code. Firstly, we need to import all packages and libraries and install *efficient net* (transfer learning model) and enable *mixed-precision*.

In [None]:
#import needed packages and libraries
!pip install -q efficientnet
!export TF_ENABLE_AUTO_MIXED_PRECISION=1
import numpy as np
import pandas as pd 
import os
import re
from kaggle_datasets import KaggleDatasets
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.keras.backend as K
import tensorflow.keras.layers as L
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model,Sequential
from tensorflow.keras import optimizers
import efficientnet.tfkeras as efn
from sklearn.model_selection import KFold

We will use TPU to train our neural network as it is quite difficult (in terms of memory storage) and long for CPU to perform these operations. Therefore, we need to prepare our kernel for the TPU usage with the following code:

In [None]:
# Detect TPU and configure the system appropriately
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)


Firstly, we detect if the TPU is available and connect with the TPU gRPC server running on the TPU VM using **TPUClusterResolver()** and print the TPU configuration using **tpu.master**. Then, if the TPU is presented, we configure the variable used in the following code to enable TPU to train the model. Moreover, we create a variable in which the number of replicas is stored. It will be necessary for subsequent calculations.

If TPU is not presented, we set the variable that will either use GPU or CPU to do the training.

The next step is to load the dataset using **KaggleDatasets** package. We need to obtain the GCS path of the dataset as it is impossible to use TPU any other way with the Kaggle data.

In [None]:
#get dataset from Kaggle to use it with TPU
DATASET = '512x512-melanoma-tfrecords-70k-images'
GCS_PATH = KaggleDatasets().get_gcs_path(DATASET)

Now as we have the data downloaded, we need to set some required parameters:

- SEED - a different seed produces a different K-fold split.


- SIZE - is a Python list of image sizes.


- BATCH_SIZES - the number of training examples utilized in one iteration (32 is a default batch size, then we multiply it by the number of replicas in TPU).


- EPOCHS - the number of passes of the entire training dataset the machine learning algorithm has completed.


- TTA - test-time augmentation. Each test image is randomly augmented and predicted TTA times and the average prediction is used.


- LR - learning rate. A tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.


- WARMUP - a parameter for learning rate schedule that will set the epoch on which the learning rate will start to decrease.


- LABEL_SMOOTHING - a regularization technique for classification problems to prevent the model from predicting the labels too confidently during training and generalizing poorly.


- AUTO - parameter that prompts the tf.data runtime to tune the value dynamically at runtime.

In [None]:
#set required parameters
SEED = 42
SIZE = [512,512]
BATCH_SIZE = 32 * strategy.num_replicas_in_sync
EPOCHS = 10
TTA = 4
LR = 0.00004
WARMUP = 5
LABEL_SMOOTHING = 0.05
AUTO = tf.data.experimental.AUTOTUNE

One important thing we must mention is the seed. We already set a parameter for seed, but now we need to make a function **seed_everything** that assures us that the model training can be reproduced. After definition of this function, we call it to actually set seed.

In [None]:
#fix random seed for reproducibility
def seed_everything(SEED):
    np.random.seed(SEED)
    tf.random.set_seed(SEED)

seed_everything(SEED)

Now, we need to set one more function - for image augmentation. As we already discussed, augmentation is an important feature to increase the diversity of the dataset by applying several modifications to the original images. 

Here, in the function **data_augment**, we use rotations (90, 180, 270, or 360 degrees); left-right and up-down flips; random hue adjustment by a small delta (0.01); random saturation, contrast and brightness adjustments with the slight difference from the original image.

These augmentations cannot produce any loss of relevant information such as when pixel drops or gaussian-blur, so they can be used to increase the accuracy by improving the generalizability.

In [None]:
#create a function to augment the image with different augmentations such as rotations,
#flips, changes in hues, saturation, contrast, and brightness to enhance the generalization of the model
def data_augment(image, label=None, seed=SEED):
    image = tf.image.rot90(image,k=np.random.randint(4))
    image = tf.image.random_flip_left_right(image, seed=seed)
    image = tf.image.random_flip_up_down(image, seed=seed)
    image = tf.image.random_hue(image, 0.01)
    image = tf.image.random_saturation(image, 0.7, 1.3)
    image = tf.image.random_con(Deotte, 2020)ΩdWQtrast(image, 0.8, 1.2)
    image = tf.image.random_brightness(image, 0.1)
    if label is None:
        return image
    else:
        return image, label

The data format we use in the following program is TFRecords, a simple format for storing a sequence of binary records. We already have the dataset of images with metadata presented in this format (Deotte, 2020), so we need to encode the TFRecords from the files to work with the data.

To decode images and labels (as well as image names for test data) from TFRecords, we need to identify the features we need to extract and parse each record (called example) TFRecordDataset.

The format of the records in the data we use is stated in the description of the dataset (Deotte, 2020):

```python
feature = {
  'image': _bytes_feature,
  'image_name': _bytes_feature,
  'patient_id': _int64_feature,
  'sex': _int64_feature,
  'age_approx': _int64_feature,
  'anatom_site_general_challenge': _int64_feature,
  'source': _int64_feature,
  'target': _int64_feature
}```

From all these features, we need only the **image** (provided in jpeg), **image_name** (for test data), and **target** (for train and validation data). 

There can be the question: why don't we use other metadata? Moreover, there is an answer to this fair question: metadata gives us worse results! It can be unclear why it happens. However, most possibly, the reason is the inequality of 'weights' that has the image and the metadata. For example, the image itself can be the most critical feature in determining if the person has melanoma, while with the usage of other metadata (without the needed weight adjustment), we diminish the impact of the image with possibly unrelated details. 

We have the code that can be used to prove that the metadata can make the model's accuracy less.

```python
def read_labeled_tfrecord(example):
    LABELED_TFREC_FORMAT = {
        # tf.string means bytestring
        "image": tf.io.FixedLenFeature([], tf.string), 
        # shape [] means single element
        "target": tf.io.FixedLenFeature([], tf.int64),
        # meta features
        "age_approx": tf.io.FixedLenFeature([], tf.int64),
        "sex": tf.io.FixedLenFeature([], tf.int64),
        "anatom_site_general_challenge": tf.io.FixedLenFeature([], tf.int64)
        
    }
    example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    label = tf.cast(example['target'], tf.float32)
    # meta features
    data = {}
    data['age_approx'] = tf.cast(example['age_approx'], tf.int32)
    data['sex'] = tf.cast(example['sex'], tf.int32)
    data['anatom_site_general_challenge'] = tf.cast(tf.one_hot(example['anatom_site_general_challenge'], 7), tf.int32)
    # returns a dataset of (image, label, data)
    return image, label, data
    
    # this function parse our image and also get our image_name (id) to perform predictions
def read_unlabeled_tfrecord(example):
    UNLABELED_TFREC_FORMAT = {
        # tf.string means bytestring
        "image": tf.io.FixedLenFeature([], tf.string), 
        # shape [] means single element
        "image_name": tf.io.FixedLenFeature([], tf.string),
        # meta features
        "age_approx": tf.io.FixedLenFeature([], tf.int64),
        "sex": tf.io.FixedLenFeature([], tf.int64),
        "anatom_site_general_challenge": tf.io.FixedLenFeature([], tf.int64)
    }
    example = tf.io.parse_single_example(example, UNLABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    image_name = example['image_name']
    # meta features
    data = {}
    data['age_approx'] = tf.cast(example['age_approx'], tf.int32)
    data['sex'] = tf.cast(example['sex'], tf.int32)
    data['anatom_site_general_challenge'] = tf.cast(tf.one_hot(example['anatom_site_general_challenge'], 7), tf.int32)
    # returns a dataset of (image, key, data)
    return image, image_name, data
```

As we could see, we need to encode only images with labels (train) and image names (test). We do this with the help of documentation presented on the official website of Tensorflow (Reference 9). 

Thus, we create functions first to decode the image and reshape it to the right size (**decode_image** function) through the usage of decode_jpeg method as our images are stored in the jpeg format inside the TFRecords, cast method to present the image as an array of floats between 0 and 1 (normalize the numbers in arrays through dividing by 255), and reshape method to change the size of the image if needed.

Second, we create two functions to read image and label (for training data) or image and image name (for test data) from TFRecords (**read_labeled_trfrecord** and **read_unlabeled_tfrecord**). Here, we define the needed format by using labels (image, target, image_name) and their values, decode images with our function **decode_image**), and cast the value if needed (i.e., the target is cast to tf.int32 format).

The last function in the following piece of code is **load_dataset**, which is used to load the TFRecords Dataset from the filenames of the files we need to open. Here, we also identify if the dataset we load is labeled (train, validation) or not (test) to use the appropriate function to load the records (**read_labeled_trfrecord** or **read_unlabeled_tfrecord**) and if the dataset needs to be ordered (test, validation) or not (train) to keep the order or shuffle the records.

In [None]:
#create a function  to decode image from jpeg and reshape it to the right size
def decode_image(image):
    image = tf.image.decode_jpeg(image, channels=3) 
    image = tf.cast(image, tf.float32)/255.0
    image = tf.reshape(image, [*SIZE, 3])
    return image

# create a function to read the image and target (train/valid data) from tfrecord    
def read_labeled_tfrecord(example):
    LABELED_TFREC_FORMAT = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "target": tf.io.FixedLenFeature([], tf.int64),  }
    example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    label = tf.cast(example['target'], tf.int32)
    return image, label 

# create a function to read the image and image name (test data) from tfrecord
def read_unlabeled_tfrecord(example):
    UNLABELED_TFREC_FORMAT = {
        "image": tf.io.FixedLenFeature([], tf.string),
        "image_name": tf.io.FixedLenFeature([], tf.string), }
    example = tf.io.parse_single_example(example, UNLABELED_TFREC_FORMAT)
    image = decode_image(example['image'])
    image_name = example['image_name']
    return image, image_name

# create a function to read full dataset from tfrecords
def load_dataset(filenames, labeled=True, ordered=False):
    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False
    dataset = (tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO) 
              .with_options(ignore_order)
              .map(read_labeled_tfrecord if labeled else read_unlabeled_tfrecord, num_parallel_calls=AUTO))
    return dataset

We need to create functions that will use our previous basic functions to get three required datasets (train, test, and validation). Their names are **get_training dataset**, **get_test_dataset**, and **get_validation_dataset**. They read the filenames and parameters (labeled and ordered). 

At this moment, we must understand why certain types of datasets having specific parameter values:

- ### train data: 
        
     - **labeled = True**: 
     In train data, we have targets (labels) in TFRecords, so when we read these files, we will need to read them using read_labeled_tfrecord function.
     
     - **Ordered = False**: 
     In train data, we need to shuffle the data, and if it is not ordered, it will eliminate the bias of the model being trained on data ordered in the biased way (i.e., containing only melanoma images in the beginning).

            
- ### validation data: 
        
     - **labeled = True**: 
     In validation data, we have targets (labels) in TFRecords, so when we read these files, we will need to read them using read_labeled_tfrecord function.
     
     - **Ordered = True**: 
     As we are using validation dataset as a test dataset to obtain accuracy while training the model, we need to keep it ordered to measure the accuracy in the same way for all epochs (as lack of order can provide different results on accuracy not based on the model itself but the order of records).

- ### test data: 
        
     - **labeled = False**: 
     In test data, we do not have targets (labels) in TFRecords, so when we read these files, we will need to read them using read_unlabeled_tfrecord function.
     
     - **Ordered = True**: 
     In test data, we need to keep the order as the CSV we will fill is ordered in this way and also because lack of order can provide different results on accuracy not based on the model itself but the order of records.


Other methods we use are:

- **repeat**: to repeat the training dataset for multiple epochs.

- **shuffle**: to shuffle the training dataset even more (as we already ignored the order).

- **batch**: to divide the dataset into small pieces, each size of which is the batch size.

- **prefetch**: Prefetching can reduce the time needed to train and test the model using a specific order of reading and using files. 'Prefetching overlaps the preprocessing and model execution of a training step. While the model executes training step s, the input pipeline is reading the data for step s+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data.'(https://www.tensorflow.org/guide/data_performance).

Moreover, we use our **data_augment** function to augment images in train (increase diversity of the input data) and test (use of TTA) datasets.

In [None]:

#create function to load training dataset, augment images in it, shuffle, and batch it        
def get_training_dataset(filenames, labeled = True, ordered = False):
    dataset = load_dataset(filenames, labeled = labeled, ordered = ordered)
    dataset = dataset.map(data_augment, num_parallel_calls = AUTO)
    # the training dataset must repeat for several epochs
    dataset = dataset.repeat()
    dataset = dataset.shuffle(2048)
    dataset = dataset.batch(BATCH_SIZE)
    # prefetch next batch while training (autotune prefetch buffer size)
    dataset = dataset.prefetch(AUTO)
    return dataset

#create function to load validation dataset, augment images in it, shuffle, and batch it
def get_validation_dataset(filenames, labeled = True, ordered = True):
    dataset = load_dataset(filenames, labeled = labeled, ordered = ordered)
    dataset = dataset.batch(BATCH_SIZE)
    # prefetch next batch while training (autotune prefetch buffer size)
    dataset = dataset.prefetch(AUTO) 
    return dataset

#create function to load test dataset and batch it
def get_test_dataset(filenames, labeled = False, ordered = True):
    dataset = load_dataset(filenames, labeled = labeled, ordered = ordered)
    dataset = dataset.map(data_augment, num_parallel_calls = AUTO)
    dataset = dataset.batch(BATCH_SIZE)
    # prefetch next batch while training (autotune prefetch buffer size)
    dataset = dataset.prefetch(AUTO) 
    return dataset

Now, we go to the actual analysis. Firstly we return a list of files that match the pattern '/train.tfrec' and '/test.tfrec' with the usage of **tf.io.gfile.glob**.

In [None]:
#read train and test filenames from the data directory
train_filenames = tf.io.gfile.glob(GCS_PATH + '/train*.tfrec')
test_filenames = tf.io.gfile.glob(GCS_PATH + '/test*.tfrec')

As the report should be informative, we include the function to display a random image with seven variations of augmentations. As a result of this function, we see how the augmentations we use to change the images. To display the images, we use **matplotlib.pyplot**.

In [None]:
#display random picture with its augmentations
def plot_transform(num_images):
    plt.figure(figsize=(30,10))
    x = load_dataset(train_filenames, labeled=False)
    image,_ = iter(x).next()
    for i in range(1,num_images+1):
        plt.subplot(1,num_images+1,i)
        plt.axis('off')
        image = data_augment(image=image)
        plt.imshow(image)
        
plot_transform(7)

Now, as we loaded and prepared the data, we can define our model and its features. We bring EfficientNetB7 with the defined size of an image, pre-trained weights (as for imagenet), average pooling (to not lose any pixels when do pooling while bringing the average of every 4 pixels of the image), and excluded top level (dense, as we define it ourselves later with sigmoid activation). After loading EfficientNetB7, we define the last layer of the network as a dense layer with one output and sigmoid activation, which means we will have the number between 0 and 1 that will show the probability of being class 1 rather than class 0 (It is usually used for binary classification, while softmax is used for multiclass one).

Then, we define the optimizer (**Adam** is a classic, modern and accurate optimizer), loss function (**Binary Cross Entropy** is a good loss practice when dealing with binary classifier), and metrics (**accuracy** that shows the frequency with which *y_pred* matches *y_true*, **AUC** that computes the approximate AUC (Area under the curve) via a Riemann sum).



In [None]:
#create a function to define a model that will be trained
#here, I use transfer learning with the Efficient net B7 model
#which is one of the most advanced model for image classification nowadays

def get_model():
    with strategy.scope():

        model = tf.keras.Sequential([
            efn.EfficientNetB7(input_shape=(*SIZE, 3),weights='imagenet',pooling='avg',include_top=False),
            Dense(1, activation='sigmoid')
        ])
    
        model.compile(
            optimizer='adam',
            loss = tf.keras.losses.BinaryCrossentropy(label_smoothing = LABEL_SMOOTHING),
            metrics=['accuracy',tf.keras.metrics.AUC(name='auc')])
    return model

One more important component of the high accuracy is a correctly defined learning rate. Here, we use **cosine schedule with warmup**. The warmup is a central epoch (5 in this paper, as we have a total of 10 epochs), before which the learning rate is growing and after which it is decreasing. 'Compared to some widely used strategies including exponential decay and step decay, the cosine decay decreases the learning rate slowly at the beginning, and then
becomes almost linear decreasing in the middle, and slows down again at the end. It potentially improves the training progress'(Zhang, 2018).


In [None]:
#create a function that defines the learning rate schedule
#using cosine schedule with warmups
def get_cosine_schedule_with_warmup(lr,num_warmup_steps, num_training_steps, num_cycles=0.5):
    def lrfn(epoch):
        if epoch < num_warmup_steps:
            return (float(epoch) / float(max(1, num_warmup_steps))) * lr
        progress = float(epoch - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))
        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress))) * lr

    return tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=True)

lr_schedule= get_cosine_schedule_with_warmup(lr=LR,num_warmup_steps=WARMUP,num_training_steps=EPOCHS)

As for the correct functioning of our model we need the number of all types of images as well as the steps per epoch (the metric that means how many steps need to be done to train all the batches of images in one epoch), we define the function **count_data_items**  that finds the sum of the files that follow a specific format in the filenames list.

In [None]:
#create function to count data items in the filenames directory
def count_data_items(filenames):
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    return np.sum(n)

#count the number of train, valid, and test files as well as steps per epochs
NUM_TRAINING_IMAGES = int(count_data_items(train_filenames) * 0.8)
NUM_VALIDATION_IMAGES = int(count_data_items(train_filenames) * 0.2)
NUM_TEST_IMAGES = count_data_items(test_filenames)
STEPS_PER_EPOCH = NUM_TRAINING_IMAGES // BATCH_SIZE

The primary step in this assignment is to write the function (called **train**) that will train our model. Firstly, we create an empty list in which the five trained models will be stored (5 because we have k-folds) and split the train files into five folds with **KFold** function. Then we create a loop that will create five models with the usage of different combinations of 4 training and one validation folds. 

The important thing here is the session clearance for each model: it is used to empty the memory from previous weights before starting to count new ones. This way, we use memory more efficiently and avoid the memory overload problem.

Another feature we implement is the early stopping, which is used to stop training if the validation AUC is not improved during 5 (patience) epochs in a row at least by 0.0001 (min_delta). Even though this callback is unlikely to be used when we have just ten epochs, it can be a good practice if we want to increase the number of epochs to 20 or more.

After training five models with slightly different accuracies, we must combine them to get the most from our k-folds cross-validation. We do this with the addition of TTA. Hence, we predict the target for every row in the submission data frame 4 times with five models (as TTA is 4). To elaborate on it, we bring the forth-part of the average of predictions of 5 models for each record four times and then summarize them in order to get the average target from our TTA and k-folds.

When we have the finalized targets, we generate a CSV file with them as well as image names (submission df) and return submission data frame as the result of our function.

In [None]:
#define the function that will perform training with the k-folds cross entropy
def train(folds = 5):
    
    models = []
    
    # seed everything
    seed_everything(SEED)

    kfold = KFold(folds, shuffle = True, random_state = SEED)
    for fold, (trn_ind, val_ind) in enumerate(kfold.split(train_filenames)):
        print('\n')
        print('-'*50)
        print(f'Training fold {fold + 1}')
        train_dataset = get_training_dataset([train_filenames[x] for x in trn_ind], labeled = True, ordered = False)
        val_dataset = get_validation_dataset([train_filenames[x] for x in val_ind], labeled = True, ordered = True)
        K.clear_session()
        model = get_model()
        # using early stopping using val loss
        early_stopping = tf.keras.callbacks.EarlyStopping(monitor = 'val_auc', mode = 'max', patience = 5, 
                                                      verbose = 1, min_delta = 0.0001, restore_best_weights = True)
        history = model.fit(train_dataset, 
                            steps_per_epoch = STEPS_PER_EPOCH,
                            epochs = EPOCHS,
                            callbacks = [early_stopping, lr_schedule],
                            validation_data = val_dataset,
                            verbose = 1)
        models.append(model)
    
    print('\n')
    print('-'*50)
    submission_df = pd.read_csv('/kaggle/input/siim-isic-melanoma-classification/sample_submission.csv')
    print('Computing predictions...')
    #using TTA to predict the test dataset predictions
    for i in range(TTA):
        test_ds = get_test_dataset(test_filenames, labeled = False, ordered = True)
        test_images_ds = test_ds.map(lambda image, image_name: image)
        probabilities = np.average([np.concatenate(models[i].predict(test_images_ds)) for i in range(folds)], axis = 0)
        test_ids_ds = test_ds.map(lambda image, image_name: image_name).unbatch()
        # all in one batch
        test_ids = next(iter(test_ids_ds.batch(NUM_TEST_IMAGES))).numpy().astype('U')
        pred_df = pd.DataFrame({'image_name': test_ids, 'target': probabilities})
        temp = submission_df.copy()   
        del temp['target']  
        submission_df['target'] += temp.merge(pred_df,on="image_name")['target']/TTA
    print('Generating submission.csv file...')
    submission_df.to_csv('submission.csv', index=False)
    return submission_df

Now it is time to train the model with our predefined function **train**.

In [None]:
train(5)

After predicting the test data labels, it is essential to look at how much melanoma images were detected. It will convince us that our model is not useless (as we mentioned before when we thought about 'fake' accuracy based on predictions that all images are benign).

In [None]:
pd.Series(np.round(submission_df['target'].values)).value_counts()

# References:

1. Brownlee, J. (2019, September 16). A Gentle Introduction to Transfer Learning for Deep Learning. Retrieved July 20, 2020, from https://machinelearningmastery.com/transfer-learning-for-deep-learning/


2. SIIM-ISIC Melanoma Classification. (2020). Retrieved July 20, 2020, from https://www.kaggle.com/c/siim-isic-melanoma-classification/data


3. Tan, M., & Le, Q.V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ArXiv, abs/1905.11946.


4. Zhang, C. (2019). How to do Transfer learning with Efficientnet. Retrieved July 22, 2020, from https://www.dlology.com/blog/transfer-learning-with-efficientnet


5. Agarwal, V. (2020, May 31). Complete Architectural Details of all EfficientNet Models. Retrieved July 22, 2020, from https://towardsdatascience.com/complete-architectural-details-of-all-efficientnet-models-5fd5b736142


6. Classification algorithms in Data Mining - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/K-fold-cross-validation-In-addition-we-outline-an-overview-of-the-different-metrics-used_fig2_326866871 [accessed 23 Jul, 2020]


7. 3.1. Cross-validation: Evaluating estimator performance¶. (n.d.). Retrieved July 23, 2020, from https://scikit-learn.org/stable/modules/cross_validation.html

 
8. Deotte, C. (2020, June 06). 512x512 Melanoma TFRecords 70k Images. Retrieved July 23, 2020, from https://www.kaggle.com/cdeotte/512x512-melanoma-tfrecords-70k-images


9. TFRecord and tf.Example &nbsp;: &nbsp; TensorFlow Core. (n.d.). Retrieved July 28, 2020, from https://www.tensorflow.org/tutorials/load_data/tfrecord?hl=en


10.  Zhang, C. (2018). Bag of Tricks for Image Classification with Convolutional Neural Networks in Keras. Retrieved August 02, 2020, from https://www.dlology.com/blog/bag-of-tricks-for-image-classification-with-convolutional-neural-networks-in-keras/