<center>
    <h1>Comparison performance of different models</h1>
</center>

In this notebook, we will train different models on a small part of the dataset in order to compare them and check which best suits for our problem.

<table align="left">
    <td>
        <a href="https://colab.research.google.com/github/dailoht/Epitech_Zoidberg2.0/blob/main/notebooks/models/pab-model_comparison.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
    </td>
</table>

# 1. Setup

In [None]:
# Import base librairies
import sys
import os
from pathlib import Path
import time

# Import scientific librairies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import Tensorflow and Keras
import tensorflow as tf
from tensorflow import keras

# Import scikit-learn
from sklearn.utils import class_weight

# Check running environment
try:
    import google.colab
    IN_COLAB=True
except:
    IN_COLAB=False

# Add project directory to kernel paths
if IN_COLAB:
    print("We're running on Colab")
    !git clone https://github.com/dailoht/Epitech_Zoidberg2.0.git
    sys.path.append('./Epitech_Zoidberg2.0')
else:
    print("We're running localy")
    sys.path.append('../..')

2023-04-18 09:56:42.017747: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


We're running localy


In [None]:
# Import custom functions
from src.visualization.plot_lib import default_viz
from src.data.file_manager import FileManager
from src.data.tf_utils import load_image_dataset_from_tfrecord, define_distribute_strategy
from src.data.evaluation import Evaluation

zoidbergManager = FileManager()
strategy = define_distribute_strategy()
evaluation = Evaluation()

# Set default graphics visualization
%matplotlib inline
default_viz()

Selected distribution strategy:                     _DefaultDistributionStrategy


2023-04-18 09:56:55.727884: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
# set random seed for keras, numpy, tensorflow, and the 'random' module
SEED = 42
tf.keras.utils.set_random_seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)

# 2. Loading dataset

In [None]:
BATCH_SIZE = 32
SMALL_TRAIN_SPLIT = 0.1
SMALL_VAL_SPLIT = 0.15
class_names = ['batceria', 'normal', 'virus']

First, let's load the train and val datasets :

In [None]:
processed_dir_path = zoidbergManager.data_dir / 'processed'

train_path = str(processed_dir_path / 'train_512x512_rgb_ds.tfrecord')
val_path = str(processed_dir_path / 'val_512x512_rgb_ds.tfrecord')

train_ds = load_image_dataset_from_tfrecord(train_path)
val_ds = load_image_dataset_from_tfrecord(val_path)

Then, we extract a small part of each datasets to train each models on a small dataset.

In [None]:
num_train_img = train_ds.reduce(0, lambda x, _: x + 1).numpy()
num_val_img = val_ds.reduce(0, lambda x, _: x + 1).numpy()

# Shuffle data
train_ds = train_ds.shuffle(buffer_size=num_train_img, seed=42)
val_ds = val_ds.shuffle(buffer_size=num_val_img, seed=42)

# Extract a sample
small_train_size = int(num_train_img * SMALL_TRAIN_SPLIT)
small_val_size = int(num_val_img * SMALL_VAL_SPLIT)

small_train_ds = train_ds.take(small_train_size)
small_val_ds = val_ds.take(small_val_size)

def count_img_by_class(dataset, class_names=class_names):
    num_img_by_classes = {name:0 for name in class_names}
    for images, labels in dataset:
        idx_label = np.nonzero(labels.numpy())[0][0]
        for idx, name in enumerate(class_names):
            if idx_label == idx:
                num_img_by_classes[name] += 1
    return num_img_by_classes

print("In training dataset, there are :")
for class_name, num_img in count_img_by_class(small_train_ds).items():
    print(f"  - {num_img} files for class {class_name}")    
print("\nIn val dataset, there are :")
for class_name, num_img in count_img_by_class(small_val_ds).items():
    print(f"  - {num_img} files for class {class_name}")

In training dataset, there are :
  - 221 files for class batceria
  - 121 files for class normal
  - 126 files for class virus

In val dataset, there are :
  - 20 files for class batceria
  - 11 files for class normal
  - 12 files for class virus


Finally, we also need to perform some actions on datasets before they can be used : 
- batching images
- prefetching images

In [None]:
# Batch & prefecth data to improve computation time
small_train_ds = small_train_ds.batch(BATCH_SIZE).prefetch(
    buffer_size=tf.data.AUTOTUNE)

small_val_ds = small_val_ds.batch(BATCH_SIZE).prefetch(
    buffer_size=tf.data.AUTOTUNE)

# 3. Training models

Let's now train models on this small dataset. We set useful variables below.  



⚠️⚠️⚠️ WARNING : Depending on your hardware, training cells can be computationally expensive and take a really long time to run them !!!  
That's why each of these cells are wrapped in a if condition (see `TRAIN_*MODEL*` booleans below).

In [None]:
EPOCHS = 30
LEARNING_RATE = 0.0001
OPTIMIZER = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
LOSS_FUNCTION = 'categorical_crossentropy'

TRAIN_VGG16 = False
TRAIN_VGG19 = False
TRAIN_RESNET50 = False
TRAIN_DENSENET101 = False
TRAIN_XCEPTION = False
TRAIN_EFFICIENTNETB0 = False

We define also 2 callbacks : 
- `checkpoint_cb` : save model at each epoch (only save best weight).
- `earlystopping_cb` : stop training if model does not progress. It is faster and helps against overfitting.

In [None]:
def checkpoint_cb(model):
    checkpoint_dir = zoidbergManager.model_dir / 'checkpoints'
    checkpoint_filepath = checkpoint_dir / f'ckpt_smallds_{model.name}.h5'
    ckpt_cb = keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_filepath,
        monitor='val_Matthews_coef',
        mode='max',
        save_best_only=True
    )
    return ckpt_cb

earlystopping_cb = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=3
)

Next, we compute class weights to prevent imbalanced classes (as we saw when we analyzed data) : 

In [None]:
y_train_iterator = train_ds.map(lambda x, y: y).as_numpy_iterator()
y_train = np.argmax(np.fromiter(y_train_iterator, dtype=np.dtype((float, 3))), axis=1)

class_weights = class_weight.compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weights
dic_class_weights = {}
for idx, weight in enumerate(class_weights):
    dic_class_weights[idx] = weight
    print(f'class {class_names[idx]} => weight : {weight:2f}')

class batceria => weight : 0.693458
class normal => weight : 1.248335
class virus => weight : 1.321207


In [None]:
def train_model(model):
    start_time = time.time()
    history = model.fit(small_train_ds,
                        validation_data=small_val_ds,
                        epochs=EPOCHS,
                        steps_per_epoch=(small_train_size // BATCH_SIZE + 1),
                        class_weight=dic_class_weights,
                        callbacks=[checkpoint_cb(model), earlystopping_cb],
                        )
    training_time = time.time() - start_time
    return history, training_time

## 3.1 VGG

In [None]:
def make_vgg16():
    base_vgg16 = tf.keras.applications.VGG16(weights='imagenet', input_shape=(224,224,3), include_top=False)
    for layer in base_vgg16.layers:
        layer.trainable = False
    
    vgg16 = tf.keras.Sequential([
        keras.layers.InputLayer(input_shape=(512,512,3), name='input'),
        keras.layers.Resizing(224, 224, interpolation="bilinear", name='resize'),
        keras.layers.Rescaling(scale=1./255., name='rescale'),
        base_vgg16,
        keras.layers.Flatten(name='flatten'),
        keras.layers.Dense(1024, activation='relu', name='fully_conn1'),
        keras.layers.Dense(512, activation='relu', name='fully_conn2'),
        keras.layers.Dense(3, activation='softmax', name='out_softmax'),
    ], name = 'vgg16')

    vgg16.compile(optimizer=OPTIMIZER,
                  loss=LOSS_FUNCTION,
                  metrics=evaluation.get_training_metrics()
                 )
    return vgg16
    
with strategy.scope():
    vgg16 = make_vgg16()

vgg16.summary()

Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 resize (Resizing)           (None, 224, 224, 3)       0         
                                                                 
 rescale (Rescaling)         (None, 224, 224, 3)       0         
                                                                 
 vgg16 (Functional)          (None, 7, 7, 512)         14714688  
                                                                 
 flatten (Flatten)           (None, 25088)             0         
                                                                 
 fully_conn1 (Dense)         (None, 1024)              25691136  
                                                                 
 fully_conn2 (Dense)         (None, 512)               524800    
                                                                 
 out_softmax (Dense)         (None, 3)                 1539  

In [None]:
if TRAIN_VGG16:
    vgg16_history, vgg16_time = train_model(vgg16)

## 3.2 ResNet

## 3.3 DenseNet

## 3.4 Xception

## 3.5 EfficientNet

In [2]:
!ls -al

total 16
drwxr-xr-x 1 root root 4096 Apr 14 13:35 .
drwxr-xr-x 1 root root 4096 Apr 18 08:39 ..
drwxr-xr-x 4 root root 4096 Apr 14 13:34 .config
drwxr-xr-x 1 root root 4096 Apr 14 13:35 sample_data
