## Análisis Exploratorio de Datos (AED) — Rayos X de Tórax

Objetivo: Explorar y caracterizar dos conjuntos de datos de imágenes médicas para clasificación de enfermedades pulmonares (neumonía y COVID-19).

Este cuaderno cubre:
- Identificación de estructura de carpetas y clases
- Distribución de clases y tamaño de los datos
- Calidad de imágenes: dimensiones, relación de aspecto, brillo, contraste y canales
- Búsqueda de posibles sesgos (clase desbalanceada, resoluciones por clase, duplicados)
- Visualización de ejemplos por clase

Conjuntos de datos:
- COVID-19 Radiography Database (Kaggle)
- Chest X-Ray Pneumonia (Kaggle)

Notas:
- Ejecuta las celdas en orden.
- Usa un entorno virtual local (`.venv`). Primero ejecuta la celda de creación de venv y luego selecciona el kernel “Python (.venv) CXR” antes de continuar con el resto del cuaderno.



In [36]:
# Crear entorno virtual local (.venv), instalar dependencias y registrar kernel Jupyter (compatible con Python 3.13)
import sys, subprocess
from pathlib import Path

PROJECT_ROOT = Path("/Users/carlosmejia/Documents/Universidad/Octavo Semestre/Tecnicas de aprendizaje de maquina/Proyecto deteccion de enfermedades").resolve()
VENV_DIR = PROJECT_ROOT / ".venv"
PY = VENV_DIR / "bin" / "python"

# 1) Crear venv si no existe
if not VENV_DIR.exists():
    print("Creando venv en:", VENV_DIR)
    subprocess.check_call([sys.executable, "-m", "venv", str(VENV_DIR)])
else:
    print("Venv ya existe:", VENV_DIR)

# 2) Actualizar pip/setuptools/wheel
subprocess.check_call([str(PY), "-m", "pip", "install", "--upgrade", "pip", "setuptools", "wheel"])

# Helper para pip
def pip_install(*args):
    return subprocess.check_call([str(PY), "-m", "pip", "install", *args])

# 3) Instalar dependencias del proyecto, resolviendo compatibilidad numpy/opencv
# 3.1 NumPy primero (última compatible con el intérprete)
pip_install("--upgrade", "numpy")

# 3.2 OpenCV: intentar versión que soporte NumPy >= 2.3; fallback si no
try:
    pip_install("--upgrade", "opencv-python>=4.13.0")
except subprocess.CalledProcessError:
    try:
        pip_install("--upgrade", "opencv-python-headless>=4.13.0")
    except subprocess.CalledProcessError:
        # Fallback: alinear a combo compatible (NumPy < 2.3 + OpenCV 4.12)
        pip_install("--upgrade", "numpy<2.3.0")
        pip_install("--upgrade", "opencv-python==4.12.0.88")

# 3.3 Resto de librerías que dependen de NumPy
for pkg in ["pandas", "matplotlib", "scikit-image"]:
    pip_install("--upgrade", pkg)

# 3.4 Utilitarios
base_pkgs = [
    "kagglehub==0.2.5",
    "pillow>=10.4.0",
    "seaborn>=0.13.2",
    "tqdm>=4.66.4",
    "ipywidgets>=8.1.5",
    "wandb>=0.17.5",
    "ipykernel>=6.29.5",
]
pip_install("--upgrade", *base_pkgs)

# 4) Registrar kernel de Jupyter para este venv
kernel_name = "cxr-venv"
kernel_display = "Python (.venv) CXR"
subprocess.check_call([str(PY), "-m", "ipykernel", "install", "--user", "--name", kernel_name, "--display-name", kernel_display])

print("\nListo. En Jupyter selecciona el kernel:", kernel_display)
print("Python del venv:")
subprocess.check_call([str(PY), "-V"])


Venv ya existe: /Users/carlosmejia/Documents/Universidad/Octavo Semestre/Tecnicas de aprendizaje de maquina/Proyecto deteccion de enfermedades/.venv
Collecting numpy
  Using cached numpy-2.3.2-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Using cached numpy-2.3.2-cp313-cp313-macosx_14_0_arm64.whl (5.1 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.6
    Uninstalling numpy-2.2.6:
      Successfully uninstalled numpy-2.2.6


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 2.3.2 which is incompatible.[0m[31m
[0m

Successfully installed numpy-2.3.2


[31mERROR: Ignored the following yanked versions: 3.4.11.39, 3.4.17.61, 4.4.0.42, 4.4.0.44, 4.5.4.58, 4.5.5.62, 4.7.0.68[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement opencv-python>=4.13.0 (from versions: 3.4.0.14, 3.4.10.37, 3.4.11.41, 3.4.11.43, 3.4.11.45, 3.4.13.47, 3.4.15.55, 3.4.16.57, 3.4.16.59, 3.4.17.63, 3.4.18.65, 4.3.0.38, 4.4.0.40, 4.4.0.46, 4.5.1.48, 4.5.3.56, 4.5.4.60, 4.5.5.64, 4.6.0.66, 4.7.0.72, 4.8.0.74, 4.8.0.76, 4.8.1.78, 4.9.0.80, 4.10.0.82, 4.10.0.84, 4.11.0.86, 4.12.0.88)[0m[31m
[0m[31mERROR: No matching distribution found for opencv-python>=4.13.0[0m[31m
[0m[31mERROR: Ignored the following yanked versions: 3.4.11.39, 3.4.11.41, 4.4.0.40, 4.4.0.42, 4.4.0.44, 4.5.5.62, 4.7.0.68, 4.8.0.74[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement opencv-python-headless>=4.13.0 (from versions: 3.4.10.37, 3.4.11.43, 3.4.11.45, 3.4.13.47, 3.4.15.55, 3.4.16.59, 3.4.17.61, 3.4.17.63, 3.4.18.65, 4.3.0.38, 4

Collecting numpy<2.3.0
  Using cached numpy-2.2.6-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Using cached numpy-2.2.6-cp313-cp313-macosx_14_0_arm64.whl (5.1 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.3.2
    Uninstalling numpy-2.3.2:
      Successfully uninstalled numpy-2.3.2
Successfully installed numpy-2.2.6
Installed kernelspec cxr-venv in /Users/carlosmejia/Library/Jupyter/kernels/cxr-venv

Listo. En Jupyter selecciona el kernel: Python (.venv) CXR
Python del venv:
Python 3.13.2


0

In [37]:
# Selecciona el kernel del venv y reinicia el estado de imports
import os, sys
print("Kernel actual:", sys.executable)

# A partir de aquí, se asume que el kernel seleccionado es "Python (.venv) CXR"



Kernel actual: /usr/local/bin/python3


In [38]:
# Configuración y utilidades (importa librerías y define helpers)
import os
import sys
import math
import json
from pathlib import Path
from typing import List, Dict, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2
from tqdm import tqdm

# Configuración visual
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

# Utilidades
PROJECT_ROOT = Path("/Users/carlosmejia/Documents/Universidad/Octavo Semestre/Tecnicas de aprendizaje de maquina/Proyecto deteccion de enfermedades").resolve()
DATA_DIR = PROJECT_ROOT / "data"
DATA_DIR.mkdir(parents=True, exist_ok=True)
print("Proyecto:", PROJECT_ROOT)
print("Datos:", DATA_DIR)

# Control de warnings de PIL
Image.MAX_IMAGE_PIXELS = None


ModuleNotFoundError: No module named 'numpy'

In [None]:
# Descarga datasets con kagglehub
import kagglehub

covid_path = kagglehub.dataset_download("tawsifurrahman/covid19-radiography-database")
print("COVID dataset path:", covid_path)

pneumonia_path = kagglehub.dataset_download("paultimothymooney/chest-xray-pneumonia")
print("Pneumonia dataset path:", pneumonia_path)

covid_path = Path(covid_path)
pneumonia_path = Path(pneumonia_path)

assert covid_path.exists(), "Ruta de COVID no existe"
assert pneumonia_path.exists(), "Ruta de Pneumonia no existe"

covid_path, pneumonia_path


In [None]:
# Funciones auxiliares y análisis de calidad
from collections import Counter

IMG_EXTS = {".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".tif"}

def list_images(root: Path) -> List[Path]:
    files = []
    for ext in IMG_EXTS:
        files.extend(root.rglob(f"*{ext}"))
    return sorted(files)


def safe_open_image(path: Path) -> Image.Image:
    try:
        img = Image.open(path)
        img.load()
        return img
    except Exception:
        return None


def compute_image_stats(path: Path) -> Dict:
    img = safe_open_image(path)
    if img is None:
        return {
            "path": str(path),
            "ok": False,
            "width": None,
            "height": None,
            "mode": None,
            "channels": None,
            "mean": None,
            "std": None,
            "median": None,
            "aspect_ratio": None,
        }
    arr = np.array(img)
    h, w = arr.shape[:2]
    channels = 1 if arr.ndim == 2 else arr.shape[2]
    if channels == 1:
        vals = arr.astype(np.float32)
    else:
        vals = cv2.cvtColor(arr, cv2.COLOR_RGB2GRAY).astype(np.float32)
    mean = float(np.mean(vals))
    std = float(np.std(vals))
    median = float(np.median(vals))
    return {
        "path": str(path),
        "ok": True,
        "width": int(w),
        "height": int(h),
        "mode": img.mode,
        "channels": int(channels),
        "mean": mean,
        "std": std,
        "median": median,
        "aspect_ratio": float(w / h) if h > 0 else None,
    }


def index_dataset(base: Path, class_dirs: Dict[str, List[str]] = None) -> pd.DataFrame:
    """Indexa imágenes y deduce clase según carpeta. class_dirs: map clase -> lista de carpetas parciales."""
    all_files = list_images(base)
    records = []
    for p in all_files:
        rel = p.relative_to(base)
        parts = [s.lower() for s in rel.parts]
        label = None
        if class_dirs:
            for cls, keys in class_dirs.items():
                for k in keys:
                    if k.lower() in parts or k.lower() in rel.as_posix().lower():
                        label = cls
                        break
                if label:
                    break
        records.append({
            "path": str(p),
            "relpath": str(rel),
            "label": label,
            "filename": p.name,
            "ext": p.suffix.lower(),
            "source_root": str(base),
        })
    return pd.DataFrame(records)


def batched(iterable, n=512):
    batch = []
    for item in iterable:
        batch.append(item)
        if len(batch) == n:
            yield batch
            batch = []
    if batch:
        yield batch


def enrich_with_stats(df: pd.DataFrame, sample_limit: int = None) -> pd.DataFrame:
    paths = df["path"].tolist()
    if sample_limit is not None:
        paths = paths[:sample_limit]
    stats = []
    for b in tqdm(batched(paths, 256), total=math.ceil(len(paths)/256)):
        for p in b:
            stats.append(compute_image_stats(Path(p)))
    stats_df = pd.DataFrame(stats)
    return df.merge(stats_df, on="path", how="left")


In [None]:
# Indexación de ambos datasets y resumen

# 1) COVID-19 Radiography Database
covid_classes = {
    "COVID": ["COVID", "COVID-19"],
    "NORMAL": ["NORMAL"],
    "PNEUMONIA": ["PNEUMONIA"],
    "LUNG_OPACITY": ["Lung_Opacity", "LUNG_OPACITY"],
}

covid_df = index_dataset(covid_path, covid_classes)

# 2) Chest X-Ray Pneumonia
pneumonia_classes = {
    "NORMAL": ["NORMAL"],
    "PNEUMONIA": ["PNEUMONIA"],
}

pneu_df = index_dataset(pneumonia_path, pneumonia_classes)

# Añadimos columna 'dataset'
covid_df["dataset"] = "COVID19-Radiography"
pneu_df["dataset"] = "ChestXray-Pneumonia"

# Unimos y limpieza básica
all_df = pd.concat([covid_df, pneu_df], ignore_index=True)
all_df = all_df[all_df["ext"].isin(list(IMG_EXTS))]
all_df = all_df[all_df["label"].notna()]

print("Imágenes válidas:", len(all_df))

# Resumen por dataset y clase
summary = (all_df.groupby(["dataset", "label"])\
           .agg(n=("path", "count"))\
           .reset_index())
summary_pivot = summary.pivot(index="dataset", columns="label", values="n").fillna(0).astype(int)
summary, summary_pivot


In [None]:
# Visualizaciones de distribución de clases

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Barras por dataset
sns.barplot(data=summary, x="dataset", y="n", hue="label", ax=axes[0])
axes[0].set_title("Distribución de clases por dataset")
axes[0].set_ylabel("Número de imágenes")
axes[0].set_xlabel("")
axes[0].tick_params(axis='x', rotation=15)

# Barras globales
global_counts = all_df["label"].value_counts().reset_index()
global_counts.columns = ["label", "n"]
sns.barplot(data=global_counts, x="label", y="n", ax=axes[1])
axes[1].set_title("Distribución global de clases")
axes[1].set_ylabel("Número de imágenes")
axes[1].set_xlabel("")

plt.tight_layout()
plt.show()


In [None]:
# Métricas de calidad de imagen en una muestra y visualizaciones

SAMPLE_LIMIT = 5000  # ajusta según recursos
quality_df = enrich_with_stats(all_df, sample_limit=SAMPLE_LIMIT)

ok_df = quality_df[quality_df["ok"]]
print("Filas con imagen legible:", int(ok_df.shape[0]))

# Distribuciones de resolución y brillo/contraste
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

sns.histplot(ok_df["width"], bins=40, ax=axes[0,0])
axes[0,0].set_title("Ancho (px)")

sns.histplot(ok_df["height"], bins=40, ax=axes[0,1])
axes[0,1].set_title("Alto (px)")

sns.histplot(ok_df["aspect_ratio"], bins=40, ax=axes[0,2])
axes[0,2].set_title("Relación de aspecto (w/h)")

sns.histplot(ok_df["mean"], bins=40, ax=axes[1,0])
axes[1,0].set_title("Brillo (media en escala gris)")

sns.histplot(ok_df["std"], bins=40, ax=axes[1,1])
axes[1,1].set_title("Contraste (desv. estándar)")

sns.histplot(ok_df["median"], bins=40, ax=axes[1,2])
axes[1,2].set_title("Mediana (escala gris)")

plt.tight_layout()
plt.show()


In [None]:
# Vista de ejemplos por clase

def show_examples(df: pd.DataFrame, cls: str, n: int = 6):
    sub = df[(df["label"] == cls) & (df["ok"] == True)].sample(min(n, sum((df["label"] == cls) & (df["ok"]))), random_state=42)
    c = len(sub)
    cols = min(6, c)
    rows = int(math.ceil(c / cols))
    fig, axes = plt.subplots(rows, cols, figsize=(3*cols, 3*rows))
    axes = np.array(axes).reshape(-1)
    for ax, (_, row) in zip(axes, sub.iterrows()):
        try:
            img = Image.open(row["path"]).convert("L")
            ax.imshow(img, cmap="gray")
            ax.set_title(Path(row["path"]).parent.name[:20])
            ax.axis('off')
        except Exception:
            ax.axis('off')
    for j in range(c, len(axes)):
        axes[j].axis('off')
    plt.suptitle(f"Ejemplos — {cls}")
    plt.tight_layout()
    plt.show()

for cls in sorted(ok_df["label"].unique()):
    show_examples(ok_df, cls, n=6)
