<a href="https://colab.research.google.com/github/faisalnazir1213/BPAI/blob/main/Datathon_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BioPhotonics and Artificial Intelligence School**   
## **Florence, 17-21 February 2025**   
## **Datathon**


# Libraries and data loading. Functions and CNN definition.

In [None]:
### Connecting gdrive into the google colab ###
from google.colab import drive
ROOT_PATH = '/content/gdrive'
drive.mount(ROOT_PATH, force_remount=True)
ROOT_PATH += '/Shared drives/Scuola_BPAI/Scuola_BPAI_2025/Datathon_2025/BPAI2025-Datathon/' # insert here your_path

Mounted at /content/gdrive


In [None]:
### Import libraries ###
from tensorflow.keras.layers import Conv2D, MaxPool2D, Dropout, Flatten, Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
from sklearn.model_selection import GroupShuffleSplit
from keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import tensorflow as tf
import random
random.seed(12345)

In [None]:
### Functions definition ###

# Function that creates the dataset for the binary classification problem, with n instances for each class
def Binary_DataFrame_creation(df, cell_type, n):
  df.drop_duplicates(subset=["lesion_id"], keep=False, inplace = True) # Remove the lesions' duplicates in order to have one image for each lesion

  # The next lines of code selected ony n instances of each class and create a the new dataframe
  df1 = df.loc[(skin_df['dx'] == cell_type[0]), :]
  df1.drop(df1.index[n::], inplace=True)
  df2 = df.loc[(skin_df['dx'] == cell_type[1]), :]
  df2.drop(df2.index[n::], inplace=True)
  frames = [df1, df2]
  df = pd.concat(frames)
  df['dx_cat'] = df['dx'].astype('category').cat.codes
  return df

# Function that splits data in training and test sets according to the validation scheme and standardize the images
def validation_scheme(df, seed = 1234):
  features=df.drop(columns=['dx_cat'],axis=1) # all the columns except the 'dx_cat' one
  target=df['dx_cat']

  x_train_o, x_test_o, y_train_o, y_test_o = train_test_split(features, target, test_size=0.20,random_state=seed, shuffle = True) # Hold-out validation scheme

  # The next lines of code perform the image standardization
  x_train = np.array(x_train_o['image'].tolist())
  x_test = np.array(x_test_o['image'].tolist())
  x_train_mean = np.mean(x_train)
  x_train_std = np.std(x_train)
  x_train = (x_train - x_train_mean)/x_train_std
  x_test = (x_test - x_train_mean)/x_train_std

  # Perform one-hot encoding on the labels
  y_train = to_categorical(y_train_o, num_classes = 2)
  y_test = to_categorical(y_test_o, num_classes = 2)

  return x_train, y_train, x_test, y_test

**CNN**   

To implement the CNN architecture, we followed the strategy proposed in https://www.kaggle.com/code/sid321axn/step-wise-approach-cnn-model-77-0344-accuracy.    
We used the Keras Sequential API, where you have just to add one layer at a time, starting from the input.

The first is the convolutional (Conv2D) layer. It is like a set of learnable filters. We choosed to set 32 filters for the two firsts conv2D layers and 64 filters for the two last ones. Each filter transforms a part of the image (defined by the kernel size) using the kernel filter. The kernel filter matrix is applied on the whole image. Filters can be seen as a transformation of the image.

The CNN can isolate features that are useful everywhere from these transformed images (feature maps).

The second important layer in CNN is the pooling (MaxPool2D) layer. This layer simply acts as a downsampling filter. It looks at the 2 neighboring pixels and picks the maximal value. These are used to reduce computational cost, and to some extent also reduce overfitting. We have to choose the pooling size (i.e the area size pooled each time) more the pooling dimension is high, more the downsampling is important.

Combining convolutional and pooling layers, CNN are able to combine local features and learn more global features of the image.

Dropout is a regularization method, where a proportion of nodes in the layer are randomly ignored (setting their wieghts to zero) for each training sample. This drops randomly a propotion of the network and forces the network to learn features in a distributed way. This technique also improves generalization and reduces the overfitting.

'relu' is the rectifier (activation function max(0,x). The rectifier activation function is used to add non linearity to the network.

The Flatten layer is use to convert the final feature maps into a one single 1D vector. This flattening step is needed so that you can make use of fully connected layers after some convolutional/maxpool layers. It combines all the found local features of the previous convolutional layers.

In the end we used the features in two fully-connected (Dense) layers which is just artificial an neural networks (ANN) classifier. In the last layer(Dense(10,activation="softmax")) the net outputs distribution of probability of each class.

In [None]:
# Function that design the CNN architecture, trains, and tests it on the training and test data respectively

def CNN(x_train, y_train, x_test, y_test, n_epochs=3, batch_size=10, seed=1234):
    # Set seeds for reproducibility
    tf.random.set_seed(seed)
    np.random.seed(seed)
    random.seed(seed)

    input_shape = (75, 100, 3)
    num_classes = 2

    # Model definition
    model = Sequential([
        Input(shape=input_shape),
        Conv2D(32, kernel_size=(3, 3), activation='relu', padding='same',
               kernel_initializer=tf.keras.initializers.GlorotUniform(seed=seed)),
        Conv2D(32, kernel_size=(3, 3), activation='relu', padding='same',
               kernel_initializer=tf.keras.initializers.GlorotUniform(seed=seed)),
        MaxPool2D(pool_size=(2, 2)),
        Dropout(0.25, seed=seed),  # Added seed to dropout

        Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same',
               kernel_initializer=tf.keras.initializers.GlorotUniform(seed=seed)),
        Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same',
               kernel_initializer=tf.keras.initializers.GlorotUniform(seed=seed)),
        MaxPool2D(pool_size=(2, 2)),
        Dropout(0.40, seed=seed),  # Added seed to dropout

        Flatten(),
        Dense(128, activation='relu',
              kernel_initializer=tf.keras.initializers.GlorotUniform(seed=seed)),
        Dropout(0.5, seed=seed),  # Added seed to dropout
        Dense(num_classes, activation='softmax',
              kernel_initializer=tf.keras.initializers.GlorotUniform(seed=seed))
    ])

    # Display model summary
    #model.summary()

    # Define the optimizer
    optimizer = Adam(learning_rate=0.001)

    # Compile the model
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["AUC"])

    # Train the model
    history = model.fit(
        x_train,
        y_train,
        batch_size=batch_size,
        epochs=n_epochs,
        validation_data=(x_test, y_test),  # Changed x_train -> x_test for validation
        verbose=1
    )

    # Evaluate the model
    loss_test, score_test = model.evaluate(x_test, y_test, verbose=0, batch_size=batch_size)
    loss_train, score_train = model.evaluate(x_train, y_train, verbose=0, batch_size=batch_size)

    return score_train, score_test

In [None]:
# Data loading
skin_df = pd.read_pickle(os.path.join(ROOT_PATH, 'data.pkl'))

# High-performance example: NV vs. MEL classification

In [None]:
cell_type = ['nv', 'mel'] # select all the images of type 'nv' or 'mel'
n = 100 # select X instances for each class (the entire dataset will contain 2X images)

In [None]:
# Creation of the dataset of size 2n, containing images belonging to the two classes selected
skin_sel_df = Binary_DataFrame_creation(skin_df, cell_type, n)
print("Skin DataFrame shape:", skin_sel_df.shape)

Skin DataFrame shape: (200, 9)


In [None]:
# Creation of the training and test set, using the validation scheme defined in the function
x_train, y_train, x_test, y_test = validation_scheme(skin_sel_df, seed = 1234)

print("x_train shape:", None if x_train is None else x_train.shape)
print("y_train shape:", None if y_train is None else y_train.shape)
print("x_test shape:", None if x_test is None else x_test.shape)
print("y_test shape:", None if y_test is None else y_test.shape)

x_train shape: (160, 75, 100, 3)
y_train shape: (160, 2)
x_test shape: (40, 75, 100, 3)
y_test shape: (40, 2)


In [None]:
# Model training and testing
score_train, score_test = CNN(x_train, y_train, x_test, y_test, n_epochs=3, batch_size=10, seed=12345)
print("The AUC-ROC in training set is", score_train)
print("The AUC-ROC in test set is", score_test)

Epoch 1/3
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 460ms/step - AUC: 0.7195 - loss: 0.7507 - val_AUC: 0.9162 - val_loss: 0.3644
Epoch 2/3
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 518ms/step - AUC: 0.9376 - loss: 0.3140 - val_AUC: 0.9931 - val_loss: 0.1502
Epoch 3/3
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 394ms/step - AUC: 0.9782 - loss: 0.1854 - val_AUC: 0.9819 - val_loss: 0.1606
The AUC-ROC in training set is 0.9807030558586121
The AUC-ROC in test set is 0.9818750619888306


# Low-performance example: BKL vs. MEL classification

In [None]:
cell_type = ['bkl', 'mel'] # select all the images of type 'bkl' or 'mel'
n = 100 # select X instances for each class (the entire dataset will contain 2X images)

In [None]:
# Creation of the dataset of size 2n, containing images belonging to the two classes selected
skin_sel_df = Binary_DataFrame_creation(skin_df, cell_type, n)
print("Skin DataFrame shape:", skin_sel_df.shape)

Skin DataFrame shape: (200, 9)


In [None]:
# Creation of the training and test set, using the validation scheme defined in the function
x_train, y_train, x_test, y_test = validation_scheme(skin_sel_df, seed = 1234)

print("x_train shape:", None if x_train is None else x_train.shape)
print("y_train shape:", None if y_train is None else y_train.shape)
print("x_test shape:", None if x_test is None else x_test.shape)
print("y_test shape:", None if y_test is None else y_test.shape)

x_train shape: (160, 75, 100, 3)
y_train shape: (160, 2)
x_test shape: (40, 75, 100, 3)
y_test shape: (40, 2)


In [None]:
# Model training and testing
score_train, score_test = CNN(x_train, y_train, x_test, y_test, n_epochs=3, batch_size=10, seed=12345)
print("The AUC-ROC in training set is", score_train)
print("The AUC-ROC in test set is", score_test)

Epoch 1/3
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 546ms/step - AUC: 0.4983 - loss: 1.0415 - val_AUC: 0.6550 - val_loss: 0.6817
Epoch 2/3
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 382ms/step - AUC: 0.6312 - loss: 0.6793 - val_AUC: 0.7609 - val_loss: 0.6630
Epoch 3/3
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 391ms/step - AUC: 0.6334 - loss: 0.6675 - val_AUC: 0.7434 - val_loss: 0.6202
The AUC-ROC in training set is 0.7586132884025574
The AUC-ROC in test set is 0.7434375286102295
