# 01 – Pré-processamento ECG (dataset real)

Este notebook documenta o pré-processamento do dataset real de ECG em imagem.

- Dataset original em `data/ecg_raw/` com pastas:
  - `ECG Images of Myocardial Infarction Patients`
  - `ECG Images of Patient that have abnormal heartbeat`
  - `ECG Images of Patient that have History of MI`
  - `Normal Person ECG Images (284x12=3408)`

- Rótulos em **português** usados no projeto:
  - `infarto_mi`
  - `batimento_anormal`
  - `historico_infarto`
  - `normal`

Objetivo:
1. Mapear pastas → rótulos.
2. Gerar [apps/vision-assistant/data/manifest.csv](cci:7://file:///Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data/manifest.csv:0:0-0:0) com colunas:
   - `filepath`, `label`, `split` (`train` / `val` / `test`).
3. Verificar distribuição dos dados.

# Imports e paths

In [7]:
import os
from pathlib import Path

import pandas as pd
from sklearn.model_selection import train_test_split

PROJECT_ROOT = Path(".").resolve().parents[2]  # .../cardioia-app.fase4
DATA_DIR = PROJECT_ROOT / "apps" / "vision-assistant" / "data"
RAW_DIR = DATA_DIR / "ecg_raw"
OUT_CSV = DATA_DIR / "manifest.csv"

PROJECT_ROOT, DATA_DIR, RAW_DIR, OUT_CSV

(PosixPath('/Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia'),
 PosixPath('/Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data'),
 PosixPath('/Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data/ecg_raw'),
 PosixPath('/Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data/manifest.csv'))

# Mapeamento pasta → label

In [8]:
FOLDER_TO_LABEL = {
    "ECG Images of Myocardial Infarction Patients": "infarto_mi",
    "ECG Images of Patient that have abnormal heartbeat": "batimento_anormal",
    "ECG Images of Patient that have History of MI": "historico_infarto",
    "Normal Person ECG Images": "normal",
}

FOLDER_TO_LABEL

{'ECG Images of Myocardial Infarction Patients': 'infarto_mi',
 'ECG Images of Patient that have abnormal heartbeat': 'batimento_anormal',
 'ECG Images of Patient that have History of MI': 'historico_infarto',
 'Normal Person ECG Images': 'normal'}

# Varredura das imagens

In [9]:
EXTS = {".jpg", ".jpeg", ".png"}

rows = []
for folder, label in FOLDER_TO_LABEL.items():
    folder_path = RAW_DIR / folder
    print("Pasta:", folder_path)
    if not folder_path.is_dir():
        print("  [AVISO] pasta não encontrada, pulando.")
        continue

    for fname in os.listdir(folder_path):
        ext = os.path.splitext(fname)[1].lower()
        if ext not in EXTS:
            continue
        fpath = folder_path / fname
        rows.append({"filepath": str(fpath), "label": label})

len(rows)

Pasta: /Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data/ecg_raw/ECG Images of Myocardial Infarction Patients
Pasta: /Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data/ecg_raw/ECG Images of Patient that have abnormal heartbeat
Pasta: /Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data/ecg_raw/ECG Images of Patient that have History of MI
Pasta: /Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data/ecg_raw/Normal Person ECG Images


928

# DataFrame e distribuição bruta

In [10]:
df = pd.DataFrame(rows)
df.head(), df["label"].value_counts()

(                                            filepath       label
 0  /Users/anakolodji/Desktop/ia/CardioIA/cardioia...  infarto_mi
 1  /Users/anakolodji/Desktop/ia/CardioIA/cardioia...  infarto_mi
 2  /Users/anakolodji/Desktop/ia/CardioIA/cardioia...  infarto_mi
 3  /Users/anakolodji/Desktop/ia/CardioIA/cardioia...  infarto_mi
 4  /Users/anakolodji/Desktop/ia/CardioIA/cardioia...  infarto_mi,
 label
 normal               284
 infarto_mi           239
 batimento_anormal    233
 historico_infarto    172
 Name: count, dtype: int64)

# Split estratificado train/val/test

In [11]:
train_df, temp_df = train_test_split(
    df, test_size=0.3, stratify=df["label"], random_state=42
)

val_df, test_df = train_test_split(
    temp_df, test_size=0.5, stratify=temp_df["label"], random_state=42
)

train_df = train_df.copy()
val_df = val_df.copy()
test_df = test_df.copy()

train_df["split"] = "train"
val_df["split"] = "val"
test_df["split"] = "test"

full_df = pd.concat([train_df, val_df, test_df], ignore_index=True)
full_df.head()

Unnamed: 0,filepath,label,split
0,/Users/anakolodji/Desktop/ia/CardioIA/cardioia...,batimento_anormal,train
1,/Users/anakolodji/Desktop/ia/CardioIA/cardioia...,infarto_mi,train
2,/Users/anakolodji/Desktop/ia/CardioIA/cardioia...,infarto_mi,train
3,/Users/anakolodji/Desktop/ia/CardioIA/cardioia...,historico_infarto,train
4,/Users/anakolodji/Desktop/ia/CardioIA/cardioia...,batimento_anormal,train


# Verificações finais e salvar manifest

In [12]:
print("Contagem por split:")
display(full_df["split"].value_counts())

print("\nContagem por label:")
display(full_df["label"].value_counts())

OUT_CSV.parent.mkdir(parents=True, exist_ok=True)
full_df.to_csv(OUT_CSV, index=False)
OUT_CSV, OUT_CSV.exists()

Contagem por split:


split
train    649
test     140
val      139
Name: count, dtype: int64


Contagem por label:


label
normal               284
infarto_mi           239
batimento_anormal    233
historico_infarto    172
Name: count, dtype: int64

(PosixPath('/Users/anakolodji/Desktop/ia/CardioIA/cardioia-app.fase4/cardioia/apps/vision-assistant/data/manifest.csv'),
 True)

## Conclusão

- `manifest.csv` gerado em `apps/vision-assistant/data/manifest.csv` com colunas: `filepath`, `label`, `split`.
- Total de 928 imagens de ECG após varredura das pastas originais.
- Distribuição por classe (labels em português):
  - `normal`: 284 imagens
  - `infarto_mi`: 239 imagens
  - `batimento_anormal`: 233 imagens
  - `historico_infarto`: 172 imagens
- Split estratificado em treino/validação/teste mantendo o balanceamento entre classes:
  - `train`: 649 imagens (~70%)
  - `val`: 139 imagens (~15%)
  - `test`: 140 imagens (~15%)

Este manifest é consumido por:
- `src/train.py` (treino do modelo ResNet18),
- `src/evaluate.py` (avaliação no conjunto de teste),
- e, indiretamente, pelo aplicativo Flask ao carregar o modelo treinado para inferência.