## Classification of skin cancer

In [8]:
import wandb

# Only execute when training

wandb.init(project="skin-cancer-detection")

wandb: Currently logged in as: csonto-benjamin (corgi-vision). Use `wandb login --relogin` to force relogin


## Loading the dataset

The dataset in processed for is too big to fit into memory *133 * 133 * 3 * float32*, even if we would use a data type with smaller precision. Our solution is to use tf.keras.utils.PyDataset as a base class for our dataset, and let it handle the dynamic loading of the data. The `create_dataset()` utility function uses this class to create a dataset object from the metadata that it receives.

However first, we are going to train an autoencoder model to create an embedding for our data, to which we can append the metadata. The `SkinCancerReconstructionDataset` object generates batches where the taget is the same as the input. It has a utility function as well: `create_reconstruction_dataset()`.

In [1]:
from preprocessing import create_reconstruction_dataset, load_metadata, upsample_metadata
from sklearn.model_selection import train_test_split
import pandas as pd


pd.options.mode.copy_on_write = True

# Load the metadata and create train, test and validation split
metadata = load_metadata()
metadata = upsample_metadata(metadata, upsample_factor=5)
metadata_train, metadata_test = train_test_split(metadata, test_size=0.3)
metadata_test, metadata_valid = train_test_split(metadata_test, test_size=0.4)

# Load the dataset generators
batch_size=32
ds_train = create_reconstruction_dataset(metadata_train, batch_size)
ds_test = create_reconstruction_dataset(metadata_test, batch_size)
ds_valid = create_reconstruction_dataset(metadata_valid, batch_size)

  metadata = pd.read_csv(METADATA_PATH, dtype={"target": "int8", "age_approx": "Int8"})


In [2]:
# Construct the input shape from the size of the images
# and the number of channels (RGB)

input_shape = (*ds_train[0][0].shape[1:3], 3)
input_shape

(133, 133, 3)

In [3]:
from tensorflow.keras.layers import Input, Conv2D, Conv2DTranspose, Cropping2D #, Dropout, Flatten, Dense, Reshape
from tensorflow.keras.models import Sequential, Model


class Autoencoder(Model):
    """Autoencoder to create an embedding for the images"""

    def __init__(self):
        super(Autoencoder, self).__init__()
        self.encoder = Sequential([
            Input(input_shape),
            Conv2D(64, 5, activation="relu", padding="same", strides=2),
            Conv2D(32, 3, activation="relu", padding="same", strides=2),
            Conv2D(1, 3, activation="relu", padding="same", strides=2),
        ])
        self.decoder = Sequential([
            Conv2DTranspose(16, 3, strides=2, padding="same", activation="relu"),
            Conv2DTranspose(32, 3, strides=2, padding="same", activation="relu"),
            Conv2DTranspose(64, 5, strides=2, padding="same", activation="relu"),
            Conv2D(1, 3, activation="sigmoid", padding="same"),
            Cropping2D(((2,1), (2,1)))
        ])

    def call(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

model = Autoencoder()
model.compile(optimizer="adam", loss="mse")
model.encoder.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

callbacks = [
    EarlyStopping(patience=20, start_from_epoch=20, restore_best_weights=True),
    ModelCheckpoint("autoencoder.keras", save_best_only=True),
    # wandb.keras.WandbMetricsLogger()
]
model.fit(ds_train, batch_size=batch_size, epochs=300, validation_data=ds_valid, callbacks=callbacks)

Epoch 1/300
[1m 124/8817[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m35:47[0m 247ms/step - loss: 0.0309

### Class weights

Positive samples are heavily under-represented, which needs to be balanced out. We use the following techniques to compensate:
* **Upsampling**<br>
    Datapoints which belong to the positive samples are added to the dataset multiple times. This is indicated by the `upscale_factor` <br>
    parameter when calling the `upscale_metata()` method.
* **Data augmenting**<br>
    To make the upsampled images more unique, some image augmentation techniques are applied. In particular horizontal and vertical mirroring <br>
    and cropping then rescaling the images. Either one or two methods are applied randomly.
* **Sample weights**<br>
    For each sample the loss function is evaluated using a corresponding weight, <br>
    which is higher for the positive samples. We use to following formula: $c_d / (2 * c_s)$, <br>
    where $c_d$ is the count of all samples and $c_s$ is the count of samples for a given class of labels.

In [10]:
ds_train.class_weights

{0: 0.5029020849376801, 1: 86.64496314496314}