# 🩺 01_Data_Preparation / Preparación de Datos

## 📝 Descripción
Este notebook corresponde a la **primera fase del pipeline** para la detección de imágenes médicas manipuladas. Aquí se realiza la **validación de la data original**, la **generación de nuevas muestras mediante técnicas de aumento de datos (data augmentation)**, y la conversión de las imágenes a un formato legible y estándar (PNG/JPG) para facilitar su uso en etapas posteriores.

---

## 🎯 Objetivos
✅ Validar la integridad de la base de datos original (archivos presentes, legibles y correctamente etiquetados).  
✅ Generar nuevas muestras para ampliar el conjunto de datos y evitar sobreajuste.  
✅ Convertir imágenes DICOM a un formato estándar y normalizar sus valores de píxel.  
✅ Preparar el dataset para las fases de extracción de características y entrenamiento.

---

## 📝 Description
This notebook corresponds to the **first stage of the pipeline** for detecting tampered medical images. Here we perform **validation of the original dataset**, **generation of new samples through data augmentation techniques**, and **conversion of images into a readable and standard format (PNG/JPG)** to facilitate usage in subsequent stages.

---

## 🎯 Objectives
✅ Validate the integrity of the original dataset (check for missing, unreadable, or mislabelled files).  
✅ Generate new samples to expand the dataset and prevent overfitting.  
✅ Convert DICOM images into a standard format and normalize pixel values.  
✅ Prepare the dataset for the feature extraction and training stages.

In [None]:
import pandas as pd
import os
import glob
import random
import cv2
import numpy as np
import pydicom

In [None]:

def create_labels_TB(df, n, seed=None):
    '''
    Create labels for the TB dataset by randomly selecting slices from the existing images.
    '''
    for idx, row in df.loc[df['type'] == 'TB'].iterrows():
        folder = os.path.join('Experiments', str(row.uuid))
        files = glob.glob(os.path.join(folder, '*.dcm'))
        id = list(map(lambda x: int(os.path.basename(x).split('.')[0]), files))
        if seed is not None:
            random.seed(seed)
        df.loc[idx, 'slice'] = random.randint(n, max(id)-n)

def add_labels(df, n=1, st=1):
    '''
    Add labels to the dataframe by generating new slices based on the existing ones.
    '''
    temp = df.copy()
    temp['generated'] = 0
    for i in range(n):
        temp1 = df.copy()
        temp2 = df.copy()
        temp1['slice'] = temp1['slice']-(1+i*st)
        temp1['generated'] = 1
        temp2['slice'] = temp2['slice']+(1+i*st)
        temp2['generated'] = 1
        temp = pd.concat([temp, temp1, temp2], ignore_index=True)
    return temp

def transfor_image(img, seed=None):
    '''
    Apply random transformations to the image.
    '''
    np.random.seed(seed)
    dgreges = np.random.randint(-7, 7)
    scale = np.random.uniform(0.9, 1.0)
    center = (img.shape[1] // 2, img.shape[0] // 2)
    M = cv2.getRotationMatrix2D(center, dgreges, scale)
    img = cv2.warpAffine(img, M, (img.shape[1], img.shape[0]), flags=cv2.INTER_LINEAR)

    std = img.std()
    gauss = np.random.normal(0, std/100, img.shape).astype(np.float32)
    beta = np.random.randint(-10, 10, img.shape).astype(np.float32)
    
    img = img + gauss
    img = img + beta
    
    img = np.clip(img, 0, 255).astype(np.uint8)
    return img

def dcm_a_png(dcm_path, png_path, window_min=-1000, window_max=400, transform=False, seed=None):
    '''
    Convert a DICOM file to PNG format with optional transformations.
    '''
    try:
        ds = pydicom.dcmread(dcm_path)
        img = ds.pixel_array.astype(np.float32)
        img = np.clip(img, window_min, window_max)
        img = ((img - window_min) / (window_max - window_min)) * 255.0
        img = img.astype(np.uint8)
        os.makedirs(os.path.dirname(png_path), exist_ok=True)
        if transform:
            img = transfor_image(img, seed=42)  # Aplicar transformación
        cv2.imwrite(png_path, img)
    except Exception as e:
        print(f"Error procesando {dcm_path}: {e}")

def convertir_df_dcm_a_png(df, input_dir='Experiments', output_dir='output_png',
                           window_min=-1000, window_max=400):
    '''
    Convert a DataFrame of DICOM files to PNG format with optional transformations.
    '''
    for idx, row in df.iterrows():
        uuid = row['uuid']
        slice_num = row['slice']
        transform = True if row['generated'] == 1 else False
        dcm_file = os.path.join(input_dir, f"{uuid}/{slice_num}.dcm")
        png_file = os.path.join(output_dir, f"{uuid}/{slice_num}.png")
        if os.path.exists(dcm_file):
            dcm_a_png(dcm_file, png_file, window_min, window_max, transform=transform, seed=42)
        else:
            print(f"No encontrado: {dcm_file}")

In [None]:
labels = pd.read_csv('labels.csv', sep=';') # Cargar dataset original

labels['tag'] = labels['type'].apply(lambda x: 1 if x.startswith('F') else (0 if x.startswith('T') else '')) # Etiquetar las clases: 1 para 'F', 0 para 'T'

print("Tamaño original:", labels.shape)
n = 1 # Elegir n para crear slices válidos en TB
create_labels_TB(labels, n, seed=42) # Asignar slices válidos a TB
labels = add_labels(labels, n, st=2) # Expandir todos los datos con n vecinos con un paso de st
#labels = expandir_tb_con_vecinos(labels, n_vecinos=1) # Balancear agregando más vecinos a TB (no duplica arbitrariamente)

print("Tamaño final:", labels.shape)
print(labels['tag'].value_counts())

In [None]:
labels.to_csv('labels_temp.csv', index=False)

In [None]:
convertir_df_dcm_a_png(labels, input_dir='Experiments', output_dir='Experiments-png')