Este archivo genera un split del dataset de imagenes y sus correspondientes asignaciones en csv. Genera un subconjunto de entrenamiento con 100 imagénes por cada clase, y uno de test con 10 imágenes por cada clase. <br>

Primero se importan las librerías necesarias.

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import shutil

In [2]:
RANDOM_STATE = 42 # Para reproducibilidad del código

## Dataset

A continuación, se implementa el método que se encargará de dividir el dataset original en un subconjunto con 100 imagenes de entrenamiento y 10 de test por clase. Son 9 clases, pero tan se generan 800 imagenes porque la clase 'UNK' no ha sido detectada en ninguna imagen. Además de dividir el conjunto de imagenes ($X$), crea el subconjunto ($y$) de los csv que etiquetan cada imagen. 

In [3]:
df = pd.read_csv("full_dataset/ISIC_2019_Training_GroundTruth.csv")
print(df.columns)
print(df["UNK"].sum()) # No contiene ningún 1
df = df.drop("UNK", axis=1)
df

Index(['image', 'MEL', 'NV', 'BCC', 'AK', 'BKL', 'DF', 'VASC', 'SCC', 'UNK'], dtype='object')
0.0


Unnamed: 0,image,MEL,NV,BCC,AK,BKL,DF,VASC,SCC
0,ISIC_0000000,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ISIC_0000001,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ISIC_0000002,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ISIC_0000003,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ISIC_0000004,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
25326,ISIC_0073247,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25327,ISIC_0073248,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25328,ISIC_0073249,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25329,ISIC_0073251,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


## Definir estructura correcta

A continuación, se crean los directorios necesarios para almacenar el dataset estructurado correctamente. Es decir, se dividen en dos carpetas train y test, y en cada una se crea una subcarpeta por cada clase que haya en el dataset

In [4]:
# Creación de carpetas necesarias

CWD = os.getcwd()
DATASET_PATH = os.path.join(CWD, "dataset")
TRAIN_PATH = os.path.join(DATASET_PATH, "train")
TEST_PATH = os.path.join(DATASET_PATH, "test")

target_names = df.columns[1:] # Lista con todos los nombres de las clases: MEL, NV, BCC, ...

try:
    os.makedirs(TRAIN_PATH)
    os.makedirs(TEST_PATH)
    
    # Creamos una carpeta por cada clase tanto en train como en test
    for c in target_names:
        os.mkdir(os.path.join(TRAIN_PATH, c))
        os.mkdir(os.path.join(TEST_PATH, c))
        
    print(f"Directorios creados exitosamente")
    
except Exception as e:
    print(f"Ocurrió un error: {e}")

Ocurrió un error: [WinError 183] No se puede crear un archivo que ya existe: 'c:\\Users\\gonza\\Desktop\\universidad\\Malaga\\Tercero\\Aprendizaje_Automatico\\practicas-aprendizaje\\practica4\\dataset\\train'


Por último, se rellena cada directorio con el conjunto de imagenes de la clase que representa dicho directorio.

In [5]:
# Llenamos las carpetas con las imágenes
for c in target_names:

    # Escogemos 110 ejemplos de la clase c aleatoriamente
    clase_110 = df[df[c] == 1].sample(110, random_state=RANDOM_STATE)
    
    # Cogemos 100 para train
    for image_id in clase_110["image"].head(100):
        shutil.copy(
            f"full_dataset/images/{image_id}.jpg", 
            f"dataset/train/{c}/{image_id}.jpg"
        )

    # Cogemos 10 para test
    for image_id in clase_110["image"].tail(10):
        shutil.copy(
            f"full_dataset/images/{image_id}.jpg", 
            f"dataset/test/{c}/{image_id}.jpg"
        )
    