# 🧪 Exploração e Análise Inicial — Breast Cancer Wisconsin (Diagnostic)

Notebook inicial para importar dados, utilitários de EDA e realizar análises exploratórias de base. Este caderno assume a estrutura de projeto com `src/` e dados em `data/wdbc.csv`.

## 🎯 Objetivos
- Garantir que o caminho do projeto esteja configurado para importar `src/`
- Carregar o dataset `data/wdbc.csv` (com *fallback* via `sklearn` se não existir)
- Executar checagens rápidas: dimensões, amostra, dtypes, **balanceamento da variável-alvo**
- Rodar resumos com `skim_numeric` e `skim_categorical` e salvar em `data/interim/eda/`
- Visualizações básicas com **matplotlib** (sem seaborn)

In [1]:
from pathlib import Path
import sys

def add_project_root(max_up=3, marker='src'):
    p = Path().resolve()
    for _ in range(max_up + 1):
        if (p / marker).exists():
            sys.path.insert(0, str(p))
            return p
        p = p.parent
    raise RuntimeError(f"Não encontrei a pasta '{marker}' nos níveis acima.")

ROOT = add_project_root()
print('Repo root:', ROOT)

Repo root: /home/carloslessa/FCD/POSTECH/modulo3/tech-challenge-3


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from typing import Optional
try:
    from src.eda.skim import skim_numeric, skim_categorical
    SKIM_OK = True
except Exception as e:
    SKIM_OK = False
    print('Aviso: não foi possível importar skim_numeric/skim_categorical. Erro:', e)
    print('Verifique se existem src/eda/skim.py e __init__.py nos diretórios src/ e src/eda/.')

In [None]:
from pathlib import Path
import sys, subprocess
import pandas as pd

DATA_PATH = ROOT / "data" / "wdbc.csv"
DATA_PATH.parent.mkdir(parents=True, exist_ok=True)

if not DATA_PATH.exists():
    print(f"Arquivo não encontrado em {DATA_PATH}. Gerando via CLI do make_wdbc_dataset ...")
    try:
        subprocess.run(
            [sys.executable, "-m", "src.data.make_wdbc_dataset", "--out", str(DATA_PATH)],
            check=True
        )
    except Exception as e:
        print("Falha ao rodar o módulo CLI; gerando via sklearn. Erro:", e)
        from sklearn.datasets import load_breast_cancer
        cancer = load_breast_cancer()
        df_tmp = pd.DataFrame(
            cancer["data"],
            columns=[c.replace(" ", "_") for c in cancer["feature_names"]]
        )
        target = pd.Categorical.from_codes(cancer["target"], cancer["target_names"])
        target = target.rename_categories({"malignant": "Maligno", "benign": "Benigno"})
        df_tmp["diagnosis"] = target.astype(str)
        df_tmp.to_csv(DATA_PATH, index=False)
        print("Gerado e salvo em", DATA_PATH)

df = pd.read_csv(DATA_PATH)
print("Shape:", df.shape)
df.head(3)


Shape: (569, 31)


Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,Maligno
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,Maligno
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,Maligno


In [None]:
print('\n### Info')
print(df.dtypes.head())
non_null = df.notna().sum().sum()
null_total = df.isna().sum().sum()
print(f'Valores não nulos (total): {non_null}')
print(f'Valores nulos (total): {null_total}')

mem_mb = df.memory_usage(deep=True).sum() / (1024 ** 2)
print(f'Memória ~ {mem_mb:.2f} MB')

## ⚖️ Balanceamento da variável-alvo (`diagnosis`)

In [None]:
if 'diagnosis' in df.columns:
    vc = df['diagnosis'].value_counts(dropna=False)
    print(vc)
    fig = plt.figure()
    plt.bar(vc.index.astype(str), vc.values)
    plt.title('Distribuição da variável-alvo: diagnosis')
    plt.xlabel('Classe')
    plt.ylabel('Contagem')
    for i, v in enumerate(vc.values):
        plt.text(i, v, str(v), ha='center', va='bottom')
    plt.show()
else:
    print('Coluna diagnosis não encontrada.')

## 🧩 Valores ausentes

In [None]:
miss = df.isna().sum().sort_values(ascending=False)
display(miss.to_frame('missing').head(10))
fig = plt.figure()
top = miss.head(10)
plt.barh(top.index.astype(str), top.values)
plt.title('Top 10 colunas com mais valores ausentes')
plt.xlabel('Quantidade de NA')
plt.gca().invert_yaxis()
plt.show()

## 🔎 Resumos (skim)

In [None]:
out_dir = Path('data/interim/eda')
out_dir.mkdir(parents=True, exist_ok=True)

if SKIM_OK:
    skim_num = skim_numeric(df)
    skim_cat = skim_categorical(df)
    display(skim_num.head(10))
    display(skim_cat)
else:
    print('Resumo detalhado (skim) indisponível. Verifique import de src/eda/skim.py.')
    # Fallback simples: describe()
    desc = df.describe().T.reset_index().rename(columns={'index':'variable'})
    display(desc.head(10))


## 🔗 Correlações (numéricas)

In [None]:
num_df = df.select_dtypes(include=[np.number])
if not num_df.empty:
    corr = num_df.corr(numeric_only=True)
    fig = plt.figure(figsize=(6,5))
    plt.imshow(corr, aspect='auto', interpolation='nearest')
    plt.title('Matriz de Correlação (numéricas)')
    plt.colorbar()
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.index)), corr.index)
    plt.tight_layout()
    plt.show()
else:
    print('Sem colunas numéricas para correlação.')

## 📈 Histogramas univariados (amostra de atributos numéricos)

In [None]:
cols = list(num_df.columns)[:6]
for c in cols:
    fig = plt.figure()
    plt.hist(num_df[c].dropna(), bins=20)
    plt.title(f'Histograma — {c}')
    plt.xlabel(c)
    plt.ylabel('Frequência')
    plt.show()

## ➡️ Próximos Passos
- Engineering de atributos e padronização/normalização (pipeline `src/features/build_features.py`)
- Divisão treino/val/test (`src/data/split.py`)
- Treinamento e comparação de modelos (`src/models/train.py`, `src/models/evaluate.py`)
- Exportar artefatos para `models/` e disponibilizar predição via API FastAPI