# Retravail de la structure du jeu de données original


**Prérequis :**  
Toutes les données extraites de COVID-QU-Ex dataset doivent être stockées dans le dossier : data/raw  
données extraites à partir du téléchargement sur https://www.kaggle.com/datasets/anasmohammedtahir/covidqu :  
- dossier : Infection Segmentation Data
- dossier : Lung Segmentation Data
- fichier COVID-QU-Ex dataset.txt

On ne va travailler que sur le dossier Lung Segmentation Data, qui contient le jeu de données complet.  

Le dossier : "data\raw\Lung Segmentation Data\Lung Segmentation Data" contient donc 3 dossiers : Test, Train, Val    
chacun découpé en : COVID-19, Non-COVID, Normal  
eux-mêmes découpés en : images, lung masks  


**Action du Notebook :**  
à partir du jeu de données complet "Lung Segmentation Data" stocké dans data/raw:  
- rassemble les .png éparpillés en 3 dossiers Test, Train, Val en 1 seul,  
- et les stocke dans data/processed.  

On abouti au final à la structure suivante dans  data/processed : 

- COVID-19
    - images
    - lung masks
- Non-COVID
    - images
    - lung masks
- Normal
    - images
    - lung masks

In [1]:
import itertools
import pandas as pd
import os
import shutil


# Gestion des dossiers à partir du dossier du notebook
notebook_dir = os.path.abspath(".")
print("dossier notebooks :", notebook_dir)

relative_raw_data_path = os.path.join("..", "data", "raw", "Lung Segmentation Data", "Lung Segmentation Data")
raw_data_path = os.path.abspath(relative_raw_data_path)
print("dossier data/raw :", raw_data_path)

relative_processed_root_path = os.path.join("..", "data", "processed")
processed_root_path = os.path.abspath(relative_processed_root_path)
print("dossier data/processed :", processed_root_path)

processed_data_path = os.path.join(processed_root_path, "Lung Segmentation Data")
print("dossier data/processed/Lung Segmentation Data :", processed_data_path)


# Niveaux de dossiers des données :
raw_dataset_folders = ["Test", "Train", "Val"]
patho_subfolders = ["COVID-19", "Non-COVID", "Normal"]
png_file_types = ["images", "lung masks"]


dossier notebooks : c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\notebooks
dossier data/raw : c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentation Data\Lung Segmentation Data
dossier data/processed : c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed
dossier data/processed/Lung Segmentation Data : c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data


In [2]:
# Creation du dictionnaire pour la correspondance entre raw et processed data :
raw_processed_paths_dict = {}

for folder in raw_dataset_folders:
    for subfolder in patho_subfolders:
        for file_type in png_file_types:
            raw_folder_path = os.path.join(raw_data_path, folder, subfolder, file_type)
            processed_folder_path = os.path.join(processed_data_path, subfolder, file_type)
            raw_processed_paths_dict[raw_folder_path] = processed_folder_path


for raw_path, processed_path in raw_processed_paths_dict.items():
    print(f"Raw path: {raw_path} --> Processed path: {processed_path}")


Raw path: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentation Data\Lung Segmentation Data\Test\COVID-19\images --> Processed path: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\COVID-19\images
Raw path: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentation Data\Lung Segmentation Data\Test\COVID-19\lung masks --> Processed path: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\COVID-19\lung masks
Raw path: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentation Data\Lung Segmentation Data\Test\Non-COVID\images --> Processed path: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\Non-COVID\images
Raw path: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentati

In [3]:
def create_folder(parent_folder: str, folder_name: str):
    """ Fonction qui crée un dossier 'folder_name' à l'emplacement 'parent_folder'.
    Si le dossier existe déjà il n'est pas re-créé.
    """
    folder_path = os.path.join(parent_folder, folder_name)

    if not os.path.exists(folder_path):
        try:
            os.makedirs(folder_path)
            print(f"Folder '{folder_path}' created successfully.")
        except OSError as e:
            print(f"Error creating folder '{folder_path}': {e}")
    else:
        print(f"Folder '{folder_path}' already exists.")



In [4]:
# Creation des nouveaux dossiers 

create_folder(processed_root_path, "Lung Segmentation Data")

for pat, typ in itertools.product(patho_subfolders, png_file_types):

    patho_path = os.path.join(processed_data_path, pat)

    create_folder(processed_data_path, pat)
    create_folder(patho_path, typ)


Folder 'c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data' created successfully.
Folder 'c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\COVID-19' created successfully.
Folder 'c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\COVID-19\images' created successfully.
Folder 'c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\COVID-19' already exists.
Folder 'c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\COVID-19\lung masks' created successfully.
Folder 'c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\Non-COVID' created successfully.
Folder 'c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\Non-CO

In [5]:
def copy_png_files(source_folder: str, destination_folder: str):
    """ Fonction qui copie un fichier .png avec ses metadata d'un dossier 'source_folder' 
    vers un dossier 'destination_folder'.
    Retourne un dataframe avec les fichiers qui n'ont pas pu être copié car ils 
    existaient déjà dans le dossier de destination (doublons).
    """
    png_files = [f for f in os.listdir(source_folder) if f.endswith('.png')]
 
    already_exist_files = [] # liste des fichiers ne pouvant pas etre copié car existent deja dans le dossier destination

    for png_file in png_files:
        source_file_path = os.path.join(source_folder, png_file)
        destination_file_path = os.path.join(destination_folder, png_file)

        # si le fichier existe deja dans le dossier de destination : le stocke dans une liste et passe au suivant sans le copier
        if os.path.exists(destination_file_path):
            print(f"Warning: File '{png_file}' already exists in '{destination_folder}'. Skipping...")
            already_exist_files.append({'file_name': png_file, 'destination_folder': destination_folder})
            continue
        
        # Copie avec metadata
        shutil.copy2(source_file_path, destination_file_path)  
        # print(f"File '{png_file}' copied to '{destination_folder}'.")

    already_exists_files_df = pd.DataFrame(already_exist_files)
    
    return already_exists_files_df



In [6]:
# Copie des .png

for key, value in raw_processed_paths_dict.items():
    print(f"Source: {key}, Destination: {value}")

    copy_png_files(key, value)
    

Source: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentation Data\Lung Segmentation Data\Test\COVID-19\images, Destination: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\COVID-19\images
Source: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentation Data\Lung Segmentation Data\Test\COVID-19\lung masks, Destination: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\COVID-19\lung masks
Source: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentation Data\Lung Segmentation Data\Test\Non-COVID\images, Destination: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\processed\Lung Segmentation Data\Non-COVID\images
Source: c:\Users\Florent\Documents\data_science\MAR24_BDS_Radios_Pulmonaire\data\raw\Lung Segmentation Data\Lung Segmentation 

In [7]:
def list_files_in_subfolders(folder: str):
    """Liste tous les fichiers dans un dossier ainsi que dans tous ses sous-dossiers.
    Retourne une liste."""
    file_list = []

    for root, dirs, files in os.walk(folder):
        for file in files:
            file_list.append(file)

    return file_list



In [8]:
# Comparaison du contenu des raw et processed data 

raw_files = list_files_in_subfolders(raw_data_path)
processed_files = list_files_in_subfolders(processed_data_path)

raw_not_in_processed = [x for x in raw_files if x not in processed_files]

print("N raw absent de processed :", len(raw_not_in_processed))
print("N fichiers dans raw :", len(raw_files))
print("N fichiers dans processed :", len(processed_files))

N raw absent de processed : 0
N fichiers dans raw : 67840
N fichiers dans processed : 67840
