# Requirements

## Data Format

Images are also provided in **`JPEG`** resized a uniform **`512x512`**.

Metadata is also provided outside of the DICOM format, in CSV files. See the Columns section for a description.

## What to predict. 
We have to predict a binary target for each image. The model model should predict the probability (floating point) between 0.0 and 1.0 that the lesion in the image is malignant (the target). In the training data, `train.csv`, the **value 0 denotes benign, and 1 indicates malignant.**

## Data Set Files
The dataset consists of images in :
* JPEG format in JPEG directory

Additionally, there is a metadata comprising of train, test and submission file in CSV format.
So the whole dataset looks like the following
* **train_color(dir)**
    * train_color --> all the jpg images in training  set
* **test_color(dir)**
    * test_color --> all the jpg images in testset    
* **train.csv** --> the training set metadata
* **test.csv**  -->the test set metadata
* sample_submission.csv --> a sample submission file in the correct format

# 1. Loading Libraries

In [1]:
import pandas as pd
import numpy as np 
import os
from pathlib import Path
import random
import tensorflow as tf
from datetime import datetime, date

from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input
from tensorflow.keras.layers import Conv2D, MaxPooling2D,GlobalAveragePooling2D, Flatten, Dense, Dropout
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split


import warnings
warnings.filterwarnings('ignore')

2023-04-28 19:12:07.850443: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-28 19:12:07.947677: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-04-28 19:12:07.947695: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-04-28 19:12:07.970997: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-28 19:12:08.553332: W tensorflow/stream_executor/platform/de

In [2]:
SEED = 1
EPOCHS = 300
BATCH_SIZE = 32
NUM_CLASSES = 2
VERBOSE_LEVEL = 1
SAVE_OUTPUT = True
IMG_SIZE = (224, 224)
INPUT_SHAPE = (224, 224, 3)

CWD = os.getcwd()
warnings.filterwarnings('ignore')

In [3]:
BASE_PATH = './'
PATH_TO_IMAGES = './train_color/train_color/' 
IMAGE_TYPE = ".jpg"

In [4]:
""" Helper function to validate the image paths

    Parameters:
        file_path (string): Path to the image 

    Returns:
        The file path if the file exists, otherwise false if the file does not exist

"""
def check_image(file_path):
    img_file = Path(file_path)
    if img_file.is_file():
        return file_path
    return False

In [5]:
""" Helper function to get the train dataset
"""
def get_train_data():
    # read the data from the train.csv file
    train = pd.read_csv(os.path.join(BASE_PATH, 'train.csv'))
    # add the image_path to the train set
    train['image_path'] = train['image_name'].apply(lambda x: PATH_TO_IMAGES + x + IMAGE_TYPE)
    # check if the we have an image 
    train['image_path'] = train.apply(lambda row : check_image(row['image_path']), axis = 1)
    # if we do not have an image we will not include the data
    train = train[train['image_path'] != False]
    print("valid rows in train", train.shape[0])
    return train

In [6]:
train_df = get_train_data()

valid rows in train 33126


In [7]:
train_df.head()

Unnamed: 0,image_name,patient_id,sex,age_approx,anatom_site_general_challenge,diagnosis,benign_malignant,target,tfrecord,width,height,image_path
0,ISIC_2637011,IP_7279968,male,45.0,head/neck,unknown,benign,0,0,6000,4000,./train_color/train_color/ISIC_2637011.jpg
1,ISIC_0015719,IP_3075186,female,45.0,upper extremity,unknown,benign,0,0,6000,4000,./train_color/train_color/ISIC_0015719.jpg
2,ISIC_0052212,IP_2842074,female,50.0,lower extremity,nevus,benign,0,6,1872,1053,./train_color/train_color/ISIC_0052212.jpg
3,ISIC_0068279,IP_6890425,female,45.0,head/neck,unknown,benign,0,0,1872,1053,./train_color/train_color/ISIC_0068279.jpg
4,ISIC_0074268,IP_8723313,female,55.0,upper extremity,unknown,benign,0,11,6000,4000,./train_color/train_color/ISIC_0074268.jpg


# Metadata Description
### Columns of `train.csv`
* image_name - unique identifier, points to filename of related DICOM image
* patient_id - unique patient identifier
* sex - the sex of the patient (when unknown, will be blank)
* age_approx - approximate patient age at time of imaging
* anatom_site_general_challenge - location of imaged site
* diagnosis - detailed diagnosis information (train only)
* benign_malignant - indicator of malignancy of imaged lesion
* target - binarized version of the target variable

In [8]:
""" Helper function check a dataframe for missing values
    Parameters:

        df (dataframe): The dataframe to check
    Returns:
        A dataframe with the number of missing and zero values for each column in percent
"""
def check_for_missing_and_null(df):
    null_df = pd.DataFrame({'columns': df.columns, 
                            'percent_null': df.isnull().sum() * 100 / len(df), 
                            'percent_zero': df.isin([0]).sum() * 100 / len(df),
                            'total_zero': df.isnull().sum() * 100 / len(df) + df.isin([0]).sum() * 100 / len(df),
                           })
    return null_df

check_for_missing_and_null(train_df)

Unnamed: 0,columns,percent_null,percent_zero,total_zero
image_name,image_name,0.0,0.0,0.0
patient_id,patient_id,0.0,0.0,0.0
sex,sex,0.19622,0.0,0.19622
age_approx,age_approx,0.205277,0.006038,0.211314
anatom_site_general_challenge,anatom_site_general_challenge,1.590895,0.0,1.590895
diagnosis,diagnosis,0.0,0.0,0.0
benign_malignant,benign_malignant,0.0,0.0,0.0
target,target,0.0,98.237034,98.237034
tfrecord,tfrecord,0.0,6.586971,6.586971
width,width,0.0,0.0,0.0


In [9]:
train = train_df.dropna()

In [10]:
# getting dummy variables for gender
sex_dummies = pd.get_dummies(train['sex'], prefix='sex', dtype="int")
train = pd.concat([train, sex_dummies], axis=1)

# getting dummy variables for anatom_site_general_challenge
anatom_dummies = pd.get_dummies(train['anatom_site_general_challenge'], prefix='anatom', dtype="int")
train = pd.concat([train, anatom_dummies], axis=1)

# getting dummy variables for target column
#target_dummies = pd.get_dummies(train['target'], prefix='target', dtype="int")
#train = pd.concat([train, target_dummies], axis=1)

# dropping not useful columns
train.drop(['sex','diagnosis','benign_malignant','anatom_site_general_challenge'], axis=1, inplace=True)

# replace missing age values wiht the mean age
train['age_approx'] = train['age_approx'].fillna(int(np.mean(train['age_approx'])))

# convert age to int
train['age_approx'] = train['age_approx'].astype('int')

print("rows in train", train.shape[0])

rows in train 32531


In [11]:
# 1 means 50 / 50 => equal amount of positive and negative cases in Training
# 4 = 20%; 8 = ~11%; 12 = ~8%
balance = 1
p_inds = train[train.target == 1].index.tolist()
np_inds = train[train.target == 0].index.tolist()

np_sample = random.sample(np_inds, balance * len(p_inds))
train = train.loc[p_inds + np_sample]
print("Samples in train", train['target'].sum()/len(train))
print("Remaining rows in train set", len(train))

Samples in train 0.5
Remaining rows in train set 1150


In [12]:
""" Helper function to create a train and a validation dataset

    Parameters:
    df (dataframe): The dataframe to split
    test_size (int): Size of the validation set
    classToPredict: The target column

    Returns:
    train_data (dataframe)
    val_data (dataframe)
"""
def create_splits(df, test_size, classToPredict):
    train_data, val_data = train_test_split(df,  test_size = test_size, random_state = 1, stratify = df[classToPredict])
    train_data, test_data = train_test_split(df,  test_size = 0.16, random_state = 1, stratify = df[classToPredict])
    return train_data, val_data, test_data

In [13]:
""" Helper function to plot the history of a tensorflow model

    Parameters:
        history (history object): The history from a tf model
        timestamp (string): The timestamp of the function execution

    Returns:
        Null
"""
def save_history(history, timestamp):
    f = plt.figure()
    f.set_figwidth(15)

    f.add_subplot(1, 2, 1)
    plt.plot(history['val_loss'], label='val loss')
    plt.plot(history['loss'], label='train loss')
    plt.legend()
    plt.title("Modell Loss")

    f.add_subplot(1, 2, 2)
    plt.plot(history['val_accuracy'], label='val accuracy')
    plt.plot(history['accuracy'], label='train accuracy')
    plt.legend()
    plt.title("Modell Accuracy")

## Transfer Learning with Resnet

In [14]:
import keras.utils as image

def extract_features(df):
    features = []
    labels = []
    for img_path in df['image_path']:
        img = image.load_img(img_path, target_size=INPUT_SHAPE)
        img_data = image.img_to_array(img)
        features.append(img_data)
        labels.append(df.loc[df['image_path'] == img_path, 'target'].iloc[0])
        
    feature_list_np = np.array(features)
    labels_list_np = np.array(labels)
    
    return feature_list_np, labels_list_np

In [15]:
# create a training and validation dataset from the train df
train_df, val_df, test_df = create_splits(train, 0.2, 'target')

print("rows in train_df", train_df.shape[0])
print("rows in val_df", val_df.shape[0])
print("rows in test_df", test_df.shape[0])

rows in train_df 966
rows in val_df 230
rows in test_df 184


In [16]:
train_features, train_labels = extract_features(train_df)
val_features, val_labels = extract_features(val_df)
test_features, test_labels = extract_features(test_df)

In [17]:
X_train, y_train = train_features, train_labels
X_val, y_val = val_features, val_labels
X_test, y_test = test_features, test_labels

In [18]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

base_model = ResNet50(include_top=False, pooling='avg', weights='imagenet', input_shape=INPUT_SHAPE)
for layer in base_model.layers[:-4]:
    layer.trainable = False

2023-04-28 19:13:58.702042: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-28 19:13:58.702287: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-04-28 19:13:58.702347: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2023-04-28 19:13:58.702404: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2023-04-28 19:13:58.702461: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Co

In [19]:
STEPS_PER_EPOCH = 966 // 32
VALID_STEPS = 230 // 32

In [31]:
def train_model(model_name='melanoma_model.h5', force_train=False):
    if not force_train and os.path.isfile(model_name):
        print("Loading model from file:", model_name)
        my_model = tf.keras.models.load_model(model_name)
        return my_model
    
    my_model = Sequential([base_model])
    my_model.add(Dense(512, activation='relu'))
    my_model.add(Dropout(0.5))
    my_model.add(Dense(1, activation='sigmoid'))

    my_model.compile(optimizer=Adam(learning_rate=0.0001),
                 loss='binary_crossentropy',
                 metrics=['accuracy', tf.keras.metrics.Recall()])

    checkpoint = ModelCheckpoint(model_name,
                                monitor="val_loss",
                                mode="min",
                                save_best_only=True,
                                verbose=1)

    earlystopping = EarlyStopping(monitor='val_loss',min_delta=0, patience=5, verbose=1, restore_best_weights=True)

    try:
        history = my_model.fit(X_train,y_train,
                               epochs=15,
                               steps_per_epoch=STEPS_PER_EPOCH,
                               validation_data=(X_val,y_val),
                               validation_steps=VALID_STEPS,
                               callbacks=[checkpoint, earlystopping]
                              )
        print("Model training completed successfully.")
        return my_model
    except KeyboardInterrupt:
        print("\nTraining Stopped")
        return False

In [32]:
my_model = train_model('melanoma_model.h5',force_train=True)

Epoch 1/15


2023-04-28 19:16:23.771860: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 581640192 exceeds 10% of free system memory.




2023-04-28 19:17:32.535084: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 138485760 exceeds 10% of free system memory.



Epoch 1: val_loss improved from inf to 0.57203, saving model to melanoma_model.h5
Epoch 2/15
Epoch 2: val_loss improved from 0.57203 to 0.55216, saving model to melanoma_model.h5
Epoch 3/15
Epoch 3: val_loss improved from 0.55216 to 0.52197, saving model to melanoma_model.h5
Epoch 4/15
Epoch 4: val_loss improved from 0.52197 to 0.51727, saving model to melanoma_model.h5
Epoch 5/15
Epoch 5: val_loss improved from 0.51727 to 0.49271, saving model to melanoma_model.h5
Epoch 6/15
Epoch 6: val_loss did not improve from 0.49271
Epoch 7/15
Epoch 7: val_loss improved from 0.49271 to 0.47370, saving model to melanoma_model.h5
Epoch 8/15
Epoch 8: val_loss improved from 0.47370 to 0.46237, saving model to melanoma_model.h5
Epoch 9/15
Epoch 9: val_loss improved from 0.46237 to 0.45768, saving model to melanoma_model.h5
Epoch 10/15
Epoch 10: val_loss improved from 0.45768 to 0.45590, saving model to melanoma_model.h5
Epoch 11/15
Epoch 11: val_loss did not improve from 0.45590
Epoch 12/15
Epoch 12:

In [36]:
probabilities = my_model.predict(X_test)



In [37]:
def pred_to_binary(pred):
    if pred < 0.5:
        return 0
    else:
        return 1

y_pred_CNN = [pred_to_binary(x) for x in probabilities]

In [39]:
from sklearn import metrics

print('Accuracy:', np.round(metrics.accuracy_score(y_test, y_pred_CNN),4))

Accuracy: 0.7717


In [23]:
test_images_id = test_df['image_name']
print('Computing predictions...')
probabilities = my_model.predict(X_test)

Computing predictions...


KeyboardInterrupt: 

In [None]:
# Create DataFrame
results_df = pd.DataFrame({
    'image_name': test_images_id,
    'target': probabilities.flatten()
})
results_df.head()

In [None]:
sub = test_df[['image_name','target']]
sub.head()

In [None]:
del sub['target']
sub = sub.merge(results_df, on='image_name')
sub.to_csv('submission_image.csv', index=False)
sub.head()

## Training using Tabular Data

In [None]:
X_train = train_df[['sex_male', 'anatom_head/neck',
       'anatom_lower extremity', 'anatom_oral/genital', 'anatom_palms/soles',
       'anatom_torso', 'anatom_upper extremity','age_approx']]
y_train = train_df['target']
X_val = val_df[['sex_male', 'anatom_head/neck',
       'anatom_lower extremity', 'anatom_oral/genital', 'anatom_palms/soles',
       'anatom_torso', 'anatom_upper extremity','age_approx']]
y_val = val_df['target']
X_test = test_df[['sex_male', 'anatom_head/neck',
       'anatom_lower extremity', 'anatom_oral/genital', 'anatom_palms/soles',
       'anatom_torso', 'anatom_upper extremity','age_approx']]
y_test = test_df['target'] 

In [None]:
import xgboost as xgb

classifier_xgb = xgb.XGBClassifier(n_estimators = 300)
classifier_xgb.fit(X_train, y_train)

In [None]:
y_pred_xgb = classifier_xgb.predict_proba(X_test)
y_pred_xgb = y_pred_xgb[:, 1]

In [None]:
# Create DataFrame
results_df = pd.DataFrame({
    'image_name': test_images_id,
    'target': y_pred_xgb
})
results_df.head()

In [None]:
sub = test_df[['image_name','target']]
sub.head()

In [None]:
del sub['target']
sub = sub.merge(results_df, on='image_name')
sub.to_csv('submission_tabular.csv', index=False)
sub.head()

## Use Both Image and Tabular Data

Kaggle's Melanoma Classification competition provides both image data and tabular data about each sample. Our task is to use both types of data to predict the probability that a sample is malignant. How can we build a model that uses both images and tabular data?

Three ideas come to mind.

1. Build a CNN image model and find a way to input the tabular data into the CNN image model
2. Build a Tabular data model and find a way to extract image embeddings and input into the Tabular data model
3. Build 2 separate models and ensemble



In [None]:
image_sub = pd.read_csv('./submission_image.csv')
tabular_sub = pd.read_csv('./submission_tabular.csv')
image_sub.head()

We are ensembling based on weighted average

In [None]:
sub = image_sub.copy()
sub.target = 0.9 * image_sub.target.values + 0.1 * tabular_sub.target.values
sub.to_csv('submission.csv',index=False)