In [13]:
import shutil
import pandas as pd
def organizar_csv(ruta_csv,ruta_img,ruta_nueva):
  # Carga los datos del csv
  df = pd.read_csv(ruta_csv)

  # Ubicación original de las imágenes
  original_dir = ruta_img

  # Directorio donde se almacenarán las imágenes organizadas
  new_dir = ruta_nueva
  # Crea el directorio si no existe
  if not os.path.exists(new_dir):
      os.makedirs(new_dir)

  # Itera sobre todas las filas del dataframe
  for index, row in df.iterrows():
      # Consigue la etiqueta y el nombre del archivo
      label = row['label']
      filename = row['filename']

      # Crea un nuevo directorio para la etiqueta si no existe
      label_dir = os.path.join(new_dir, str(label))
      if not os.path.exists(label_dir):
          os.makedirs(label_dir)

      # Copia la imagen al nuevo directorio
      original_file_path = os.path.join(original_dir, filename)
      new_file_path = os.path.join(label_dir, f'{index}.jpg')
      shutil.copyfile(original_file_path, new_file_path)

# **2D Classification pipeline**
___  
  
In this notebook we show how to apply a [BiaPy](https://biapy.readthedocs.io/en/latest/) pipeline for **2D classification** of butterfly data.

**Without any coding**, we explain step by step how to
1. **upload a set of training and test images** which need to be organized in folders, one for each class,
2. **train a deep neural network (DNN)** model on the training set,
3. **apply the model** to the test images, and
4. **download the classification results** to your local machine.

**Disclaimer:** The structure of the notebook is heavily inspired in the fantastic [ZeroCostDL4Mic notebooks](https://github.com/HenriquesLab/ZeroCostDL4Mic/wiki).

**Contact:** This notebook has been made by [Ignacio Arganda-Carreras](mailto:ignacio.arganda@ehu.eus) and [Daniel Franco-Barranco](mailto:daniel.franco@dipc.org). If you have any suggestion or comment, or find any problem, please write us an email or [create an issue in BiaPy's repository](https://github.com/danifranco/BiaPy/issues). Thanks!

## **Expected inputs and outputs**
___
**Inputs**

This notebook expects three folders as input:
* **Training raw images**: with the raw 2D images to train the model.
* **Test raw images**: with the raw 2D images to test the model.
* **Output folder**: a path to store the classification results.

**Outputs**

If the execution is successful, a folder will be created containing the classification results. The resulting csv file can be downloaded at the end of the notebook.



## **Prepare the environment**
___

Establish connection with Google services. You **must be logged in to Google** to continue.
Since this is not Google's own code, you will probably see a message warning you of the dangers of running unfamiliar code. This is completely normal.

### **Download an example dataset**
---
If you do not have data at hand but would like to test the notebook, no worries! You can run the following cell to download an example dataset.

In particular, we will use a particular subset of [Butterfly](https://www.kaggle.com/datasets/phucthaiv02/butterfly-image-classification) dataset, concretely DermaMNIST dataset publicly available online.

In [1]:
import os

os.chdir('/content/')
#https://drive.google.com/file/d//view?usp=sharing
!curl -L -s -o archive.zip 'https://drive.google.com/uc?id=1GBsn30dCC_XjGwtFLyMMhnAwTLFzW3vV&confirm=t'

!unzip -q archive.zip
!rm archive.zip

print('Dataset downloaded and unzipped under /content')

Dataset downloaded and unzipped under /content/data



## **Check for GPU access**
---

By default, the session should be using Python 3 and GPU acceleration, but it is possible to ensure that these are set properly by doing the following:

Go to **Runtime -> Change the Runtime type**

**Runtime type: Python 3** *(Python 3 is programming language in which this program is written)*

**Accelerator: GPU** *(Graphics processing unit)*

## **Paths to load input images and save output files**
___

If option 1 (uploading the folder) or option 3 (downloading our prepared data samples) were chosen, define train_data_path as '/content/data/train', val_data_path as '/content/data/val' (if not using validation from train which can be ignored if so), test_data_path as '/content/data/test' and output_path as '/content/out'. Please make sure you download the results from the '/content/out' folder later!

If option 2 is chosen, introduce here the paths to your input files and to the folder where you want to store the results. E.g. '/content/gdrive/MyDrive/...'.

In case you have troubles finding the path to your folders, at the top left of this notebook you will find a small folder icon. Explore until you find the folders. There you can copy the folder path by right clicking and clicking "copy".

In [56]:
#@markdown #####Path to train images
train_data_path = '/content/data/train' #@param {type:"string"}
#@markdown #####Path to validation images (necessary only if you do not want to extract validation from train, i.e. when **validation_from_train** variable below is **False**)
val_data_path = '/content/data/train' #@param {type:"string"}
#@markdown #####Path to test images
test_data_path = '/content/data/t_test' #@param {type:"string"}
#@markdown #####Path to store the resulting images (it'll be created if not existing):
output_path = '/content/output' #@param {type:"string"}

## **Install BiaPy library**

In [23]:
#@markdown ##Play to install BiaPy and its dependences

import os
import sys
import numpy as np
from tqdm.notebook import tqdm
from skimage.io import imread
from skimage.exposure import match_histograms

# Clone the repo
os.chdir('/content/')
if not os.path.exists('BiaPy'):
    !git clone --depth 1 https://github.com/danifranco/BiaPy.git
    !pip install --upgrade --no-cache-dir gdown &> /dev/null
    sys.path.insert(0, 'BiaPy')
    os.chdir('/content/BiaPy')

    # Install dependencies
    !pip install git+https://github.com/aleju/imgaug.git &> /dev/null
    !pip install numpy_indexed yacs fill_voids edt &> /dev/null
else:
    print( 'Using existing installed version of BiaPy' )

Using existing installed version of BiaPy


## **Configure and train the DNN model**
[BiaPy](https://biapy.readthedocs.io/en/latest/) contains a few deep learning models to perform classification.

The selection of the model and the pipeline hyperparameters can be configured by editing the YAML configuration file or (easier) by running the next cell.

In [None]:
organizar_csv('/content/Training_set.csv','/content/train', '/content/data/train')

### **Select your parameters**
---
#### **Name of the model**
* **`model_name`:** Use only my_model -style, not my-model (Use "_" not "-"). Do not use spaces in the name. Avoid using the name of an existing model (saved in the same folder) as it will be overwritten.

#### **Data management**

* **`validation_from_train`:** Select to extract validation data from the training samples. If is not selected the validation data path must be set in **val_data_path** variable above.

* **`percentage_validation`:**  Input the percentage of your training dataset you want to use to validate the network during the training. **Default value: 10**

* **`test_ground_truth`:** Select to use test data folder order as the ground truth class to measure the performance of the model's result. **Default value: True**

#### **Basic training parameters**
* **`number_of_classes`:** Input number of classes present in the problem. It must be equal to the number of subfolders in training and validation (if not extracted from train) folders.

* **`number_of_epochs`:** Input how many epochs (rounds) the network will be trained. For the example dataset, reasonable results can already be observed after 100 epochs. **Default value: 100**

* **`patience`:**  Input how many epochs you want to wait without the model improving its results in the validation set to stop training. **Default value: 20**

#### **Advanced Parameters - experienced users only**
* **`model_architecture`:**  Select the architecture of the DNN used as backbone of the pipeline. Options: EfficientNet B0 and a simple CNN. **Default value: EfficientNet B0**

* **`batch_size:`** This parameter defines the number of patches seen in each training step. Reducing or increasing the **batch size** may slow or speed up your training, respectively, and can influence network performance. **Default value: 12**

* **`patch_size`:** Input the size of the patches use to train your model (length in pixels in X and Y). The value should be smaller or equal to the dimensions of the image. **Default value: 28**

* **`input_channels`:** Input the number of channels of your images (grayscale = 1, RGB = 3). **Default value: 3**

* **`optimizer`:** Select the optimizer used to train your model. Options: ADAM, Stocastic Gradient Descent (SGD). ADAM converges usually faster but SGD is known for better generalization. **Default value: SGD**

* **`initial_learning_rate`:** Input the initial value to be used as learning rate. If you select ADAM as optimizer, this value should be around 10e-4. **Default value: 0.001**

In [57]:
#@markdown ###Name of the model:
model_name = "my_2d_classification_reduce" #@param {type:"string"}

#@markdown ### Data management:
validation_from_train = True #@param {type:"boolean"}
percentage_validation =  10 #@param {type:"number"}
test_ground_truth = False #@param {type:"boolean"}

#@markdown ### Basic training parameters:
number_of_classes = 75#@param {type:"number"}
number_of_epochs =  100#@param {type:"number"}
patience =  20#@param {type:"number"}

#@markdown ### Advanced training parameters:

model_architecture = "simple_cnn" #@param ["EfficientNetB0", "simple_cnn"]

batch_size =  32#@param {type:"number"}
patch_size = 224 #@param {type:"number"}

input_channels = 3 #@param {type:"number"}

optimizer = "ADAMW" #@param ["ADAM", "SGD", "ADAMW"]
initial_learning_rate = 0.001 #@param {type:"number"}

In [58]:
#@markdown ##Play to download the YAML configuration file and update it to train the model
import errno

os.chdir('/content/')

job_name = model_name
yaml_file = "/content/"+str(job_name)+".yaml"

# remove previous configuration file if it exists with the same name
if os.path.exists( yaml_file ):
    os.remove( yaml_file )

# Download template file
import shutil
shutil.copy("/content/BiaPy/templates/classification/classification_2d.yaml", yaml_file)

# Check folders before modifying the .yaml file
if not os.path.exists(train_data_path):
    raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), train_data_path)
ids = sorted(next(os.walk(train_data_path))[1])
if len(ids) == 0:
    raise ValueError("No folders found in dir {}".format(train_data_path))

if not os.path.exists(val_data_path):
    raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), val_data_path)
ids = sorted(next(os.walk(val_data_path))[1])
if len(ids) == 0:
    raise ValueError("No folders found in dir {}".format(val_data_path))

if not os.path.exists(test_data_path):
    raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), test_data_path)
ids = sorted(next(os.walk(test_data_path))[1])
if len(ids) == 0:
    raise ValueError("No folders found in dir {}".format(test_data_path))


# open template configuration file
import yaml
with open( yaml_file, 'r') as stream:
    try:
        biapy_config = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

# update paths to data
biapy_config['DATA']['TRAIN']['PATH'] = train_data_path
biapy_config['DATA']['VAL']['PATH'] = val_data_path
biapy_config['DATA']['TEST']['PATH'] = test_data_path
biapy_config['DATA']['TEST']['LOAD_GT'] = test_ground_truth

# update data patch size
biapy_config['DATA']['PATCH_SIZE'] = '('+str(patch_size)+', '+ str(patch_size)+', ' + str(input_channels)+')'
# adjust test padding accordingly
padding = patch_size // 8
biapy_config['DATA']['TEST']['PADDING'] = '('+str(padding)+', '+ str(padding)+')'

# update training parameters
biapy_config['DATA']['VAL']['FROM_TRAIN'] = validation_from_train
biapy_config['DATA']['VAL']['SPLIT_TRAIN'] = percentage_validation/100.0
biapy_config['TRAIN']['EPOCHS'] = number_of_epochs
biapy_config['TRAIN']['PATIENCE'] = patience
biapy_config['TRAIN']['BATCH_SIZE'] = batch_size
biapy_config['TRAIN']['OPTIMIZER'] = optimizer
biapy_config['TRAIN']['LR'] = initial_learning_rate

# Transcribe model architecture
# Available models: "simple_cnn", "EfficientNetB0"
architecture = 'simple_cnn'
if model_architecture == "simple_cnn":
    architecture = 'simple_cnn'
else:
    architecture = 'EfficientNetB0'
biapy_config['MODEL']['N_CLASSES'] = number_of_classes
biapy_config['MODEL']['ARCHITECTURE'] = architecture

# save file
with open( yaml_file, 'w') as outfile:
    yaml.dump(biapy_config, outfile, default_flow_style=False)

print( "Training configuration finished.")

Training configuration finished.


In [59]:
import os

def contar_imagenes_en_directorio(directorio):
    contador = 0
    extensiones_validas = ['.jpg', '.png', '.jpeg', '.gif']  # Puedes agregar más extensiones si es necesario

    for carpeta, _, archivos in os.walk(directorio):
        for archivo in archivos:
            _, extension = os.path.splitext(archivo)
            if extension.lower() in extensiones_validas:
                contador += 1

    return contador

directorio = "/content/data/train"
cantidad_imagenes = contar_imagenes_en_directorio(directorio)
print(f"El directorio '{directorio}' contiene {cantidad_imagenes} imágenes.")


El directorio '/content/data/train' contiene 2475 imágenes.


In [48]:
import os

def contar_imagenes_en_carpeta(carpeta):
    contador = 0
    extensiones_validas = ['.jpg', '.png', '.jpeg', '.gif']  # Puedes agregar más extensiones si es necesario

    for archivo in os.listdir(carpeta):
        _, extension = os.path.splitext(archivo)
        if extension.lower() in extensiones_validas:
            contador += 1

    return contador

directorio = "/content/data/train"

for carpeta in os.listdir(directorio):
    ruta_carpeta = os.path.join(directorio, carpeta)
    if os.path.isdir(ruta_carpeta):
        cantidad_imagenes = contar_imagenes_en_carpeta(ruta_carpeta)
        print(f"En la carpeta '{carpeta}' hay {cantidad_imagenes} imágenes.")


En la carpeta 'CHECQUERED SKIPPER' hay 95 imágenes.
En la carpeta 'PURPLISH COPPER' hay 92 imágenes.
En la carpeta 'BANDED PEACOCK' hay 83 imágenes.
En la carpeta 'SCARCE SWALLOW' hay 97 imágenes.
En la carpeta 'CLEOPATRA' hay 93 imágenes.
En la carpeta 'CRECENT' hay 97 imágenes.
En la carpeta 'ULYSES' hay 84 imágenes.
En la carpeta 'RED SPOTTED PURPLE' hay 86 imágenes.
En la carpeta 'PAINTED LADY' hay 78 imágenes.
En la carpeta 'CHESTNUT' hay 85 imágenes.
En la carpeta 'QUESTION MARK' hay 77 imágenes.
En la carpeta 'SOOTYWING' hay 90 imágenes.
En la carpeta 'COMMON BANDED AWL' hay 87 imágenes.
En la carpeta 'IPHICLUS SISTER' hay 95 imágenes.
En la carpeta 'ORCHARD SWALLOW' hay 76 imágenes.
En la carpeta 'RED ADMIRAL' hay 82 imágenes.
En la carpeta 'AN 88' hay 85 imágenes.
En la carpeta 'PURPLE HAIRSTREAK' hay 79 imágenes.
En la carpeta 'ORANGE OAKLEAF' hay 87 imágenes.
En la carpeta 'LARGE MARBLE' hay 81 imágenes.
En la carpeta 'RED CRACKER' hay 96 imágenes.
En la carpeta 'MALACHITE' 

In [49]:
import os
import random
import shutil

def contar_imagenes_en_carpeta(carpeta):
    contador = 0
    extensiones_validas = ['.jpg', '.png', '.jpeg', '.gif']  # Puedes agregar más extensiones si es necesario

    for archivo in os.listdir(carpeta):
        _, extension = os.path.splitext(archivo)
        if extension.lower() in extensiones_validas:
            contador += 1

    return contador

def eliminar_imagenes_excedentes(carpeta, objetivo):
    imagenes_actuales = contar_imagenes_en_carpeta(carpeta)
    excedentes = imagenes_actuales - objetivo

    if excedentes <= 0:
        return

    imagenes_a_eliminar = random.sample(os.listdir(carpeta), excedentes)
    for imagen in imagenes_a_eliminar:
        ruta_imagen = os.path.join(carpeta, imagen)
        os.remove(ruta_imagen)

directorio = "/content/data/train"
objetivo_total_imagenes = 2500
objetivo_por_carpeta = objetivo_total_imagenes // len(os.listdir(directorio))

for carpeta in os.listdir(directorio):
    ruta_carpeta = os.path.join(directorio, carpeta)
    if os.path.isdir(ruta_carpeta):
        eliminar_imagenes_excedentes(ruta_carpeta, objetivo_por_carpeta)

cantidad_total_imagenes = sum(contar_imagenes_en_carpeta(os.path.join(directorio, carpeta)) for carpeta in os.listdir(directorio))
print(f"Se ha logrado un total de aproximadamente {cantidad_total_imagenes} imágenes.")


Se ha logrado un total de aproximadamente 2475 imágenes.


In [50]:
for carpeta in os.listdir(directorio):
    ruta_carpeta = os.path.join(directorio, carpeta)
    if os.path.isdir(ruta_carpeta):
        cantidad_imagenes = contar_imagenes_en_carpeta(ruta_carpeta)
        print(f"En la carpeta '{carpeta}' hay {cantidad_imagenes} imágenes.")

En la carpeta 'CHECQUERED SKIPPER' hay 33 imágenes.
En la carpeta 'PURPLISH COPPER' hay 33 imágenes.
En la carpeta 'BANDED PEACOCK' hay 33 imágenes.
En la carpeta 'SCARCE SWALLOW' hay 33 imágenes.
En la carpeta 'CLEOPATRA' hay 33 imágenes.
En la carpeta 'CRECENT' hay 33 imágenes.
En la carpeta 'ULYSES' hay 33 imágenes.
En la carpeta 'RED SPOTTED PURPLE' hay 33 imágenes.
En la carpeta 'PAINTED LADY' hay 33 imágenes.
En la carpeta 'CHESTNUT' hay 33 imágenes.
En la carpeta 'QUESTION MARK' hay 33 imágenes.
En la carpeta 'SOOTYWING' hay 33 imágenes.
En la carpeta 'COMMON BANDED AWL' hay 33 imágenes.
En la carpeta 'IPHICLUS SISTER' hay 33 imágenes.
En la carpeta 'ORCHARD SWALLOW' hay 33 imágenes.
En la carpeta 'RED ADMIRAL' hay 33 imágenes.
En la carpeta 'AN 88' hay 33 imágenes.
En la carpeta 'PURPLE HAIRSTREAK' hay 33 imágenes.
En la carpeta 'ORANGE OAKLEAF' hay 33 imágenes.
En la carpeta 'LARGE MARBLE' hay 33 imágenes.
En la carpeta 'RED CRACKER' hay 33 imágenes.
En la carpeta 'MALACHITE' 

In [60]:
#@markdown ##Play to train the model

import os
import errno

# Run the code
os.chdir('/content/BiaPy')
!python -u main.py --config '/content/'{job_name}'.yaml' --result_dir {output_path} --name {job_name} --run_id 1 --gpu 0

Date: 2023-07-27 11:29:10
Arguments: Namespace(config='/content/my_2d_classification_reduce.yaml', result_dir='/content/output', name='my_2d_classification_reduce', run_id=1, gpu='0')
Job: my_2d_classification_reduce_1
Python       : 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
Keras        : 2.12.0
Tensorflow   : 2.12.0
Num GPUs Available:  0
Configuration details:
AUGMENTOR:
  AFFINE_MODE: constant
  AUG_NUM_SAMPLES: 10
  AUG_SAMPLES: True
  BRIGHTNESS: False
  BRIGHTNESS_EM: False
  BRIGHTNESS_EM_FACTOR: (-0.1, 0.1)
  BRIGHTNESS_EM_MODE: 3D
  BRIGHTNESS_FACTOR: (-0.1, 0.1)
  BRIGHTNESS_MODE: 3D
  CBLUR_DOWN_RANGE: (2, 8)
  CBLUR_INSIDE: True
  CBLUR_SIZE: (0.2, 0.4)
  CHANNEL_SHUFFLE: False
  CMIX_SIZE: (0.2, 0.4)
  CNOISE_NB_ITERATIONS: (1, 3)
  CNOISE_SCALE: (0.05, 0.1)
  CNOISE_SIZE: (0.2, 0.4)
  CONTRAST: False
  CONTRAST_EM: False
  CONTRAST_EM_FACTOR: (-0.1, 0.1)
  CONTRAST_EM_MODE: 3D
  CONTRAST_FACTOR: (-0.1, 0.1)
  CONTRAST_MODE: 3D
  COUT_APPLY_TO_MASK: False
  COUT_C