# MVP — Predição de Desgaste (Wr/Wm) — Execução

> **Granularidade temporal:** 1 amostra por **hora** (timestamps horários).  
> **Dataset base:** `A1_ML_DL.csv` (2 linhas de cabeçalho: **[nome | dimensão]**).

## Objetivo
1) **Gerar Wr e Wm por hora** usando as variáveis físicas da **coluna 152 (`TAU_DENSA`) até a última** com base no equacionamento do PDF interno.  
2) **Evitar vazamento:** as colunas usadas no cálculo (152→fim) **não** entram como features para o modelo.  
3) Produzir dois arquivos:
   - `A1_ML_DL_rotulado.csv` → contém todas as colunas **+** `wr_kg_m2_h`, `wm_kg_m2_h` (2ª linha preserva dimensões).
   - `A1_ML_DL_features.csv` → versão para modelagem **sem** as colunas 152→fim (mantém Wr/Wm como alvos).
4) **EDA rápida** dos rótulos (distribuição, nulos, estatísticas).

## Por que **não** embaralhar (shuffle)?
É **série temporal**. Embaralhar traria informação do **futuro** para o treino (vazamento) e deixaria as métricas **otimistas**.  
Correto: separação por **blocos de tempo** (Treino → Validação → Teste) e, se possível, **backtesting** (janela móvel).

## Rotas de modelagem (resumo)
- **Supervisionada (MVP principal):** prever `Wr(t+H)` e `Wm(t+H)` com:
  - **Tabular** (XGBoost/LightGBM) usando features agregadas por janela (W=1–4h; H=1–4h).
  - **Sequencial** (TCN/GRU) usando a sequência horária normalizada.
- **Não-supervisionada (radar complementar):** PCA+monitoramento, Autoencoder sequencial, change-points, clustering de janelas — para detectar **mudanças de regime** e **anomalias** que antecedem aumento de desgaste.

## Convenções
- Leitura do CSV: `header=[0,1]`.  
- 1ª linha: **nomes** | 2ª linha: **dimensões**.  
- A partir de **`TAU_DENSA`** (coluna **152**, 1-indexado) até o final → **apenas** para calcular Wr/Wm.  
- **Busca semântica ativada**: quando você citar um nome de variável, o pipeline tenta localizar a coluna correspondente (PT/EN) sem exigir igualdade literal.


In [3]:
# ==============================================================
# (1) IMPORTS & CONFIGURAÇÃO GERAL
# ==============================================================
import os, re, unicodedata, time
import numpy as np
import pandas as pd

# Caminho base (ajuste se necessário)
CURATED_DIR = r"C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated"
PATH_IN   = os.path.join(CURATED_DIR, "A1_ML_DL.csv")
PATH_OUTL = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado.csv")
PATH_OUTF = os.path.join(CURATED_DIR, "A1_ML_DL_features.csv")

# Constantes físicas e padrões
T_N      = 273.15      # K
P_N      = 1.01325e5   # Pa
R        = 8.314462618 # J/mol/K
M_AIR    = 0.029       # kg/mol
M_CO2    = 0.044       # kg/mol

# Parâmetros de engenharia (ajustáveis conforme planta)
A_EF     = 15.0        # m²  (área efetiva)
L_REF    = 3.0         # m   (altura efetiva para ΔP)
BETA_DP  = 0.25        # expoente fraco para ajuste por ΔP
GAMMA_DEFAULT = 0.25   # fração CO2 padrão, se não houver coluna
C_DELTA  = 1e-3        # ganho para trazer δ para ~0.1–0.3 mm (m)

# Constantes BASE (Wr/Wm)
ALPHA_BASE = 4.2e-3
BETA_BASE  = 1.2e-4
TCRIT_BASE = 1200.0
TOPT_BASE  = 1250.0
SIGMA_MAT_BASE   = 6.0e9
RHO_FUEL_BASE    = 700.0
SIGMA_STEEL_BASE = 2.5e9

# ==============================================================
# (2) FUNÇÕES AUXILIARES — NORMALIZAÇÃO & SIMILARIDADE
# ==============================================================
def strip_accents(s: str) -> str:
    return "".join(c for c in unicodedata.normalize("NFKD", str(s)) if not unicodedata.combining(c))

def normalize_token(s: str) -> str:
    s = strip_accents(s).lower()
    s = re.sub(r"[^a-z0-9_\s]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def split_tokens(s: str):
    return set(t for t in re.split(r"[ _]+", normalize_token(s)) if t)

from difflib import SequenceMatcher
def hybrid_similarity(a: str, b: str) -> float:
    # mistura jaccard (tokens) + razão de sequência (robusta a variações)
    ta, tb = split_tokens(a), split_tokens(b)
    jac = len(ta & tb) / len(ta | tb) if (ta and tb) else 0.0
    seq = SequenceMatcher(None, normalize_token(a), normalize_token(b)).ratio()
    return 0.6*jac + 0.4*seq

def best_match(query: str, candidates: list):
    scores = [(c, hybrid_similarity(query, c)) for c in candidates]
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[0] if scores else (None, 0.0)

# ==============================================================
# (3) LEITURA DO DATASET (2 CABEÇALHOS) & DICIONÁRIO DE DIMENSÕES
# ==============================================================
df_raw = pd.read_csv(PATH_IN, header=[0,1], engine="python")
dim_por_col = {col: dim for (col, dim) in df_raw.columns}  # {nome: dimensão}
df = df_raw.copy()
df.columns = [col for (col, dim) in df_raw.columns]        # “achata” para os nomes
colunas = list(df.columns)

# Localiza início do bloco físico (TAU_DENSA->fim). Se não existir, fallback = posição 152 (1-indexado)
if "TAU_DENSA" in colunas:
    idx_tau = colunas.index("TAU_DENSA")
else:
    c, s = best_match("tau_densa", colunas)
    idx_tau = colunas.index(c) if c else 151  # 151 (0-index) ~ 152 (1-index)
print(f"Início do bloco físico para cálculo: coluna #{idx_tau+1} -> {colunas[idx_tau]} (até o final). Total usadas: {len(colunas[idx_tau:])}")

# ==============================================================
# (4) MAPEAMENTO SEMÂNTICO ROBUSTO (com dicas por dimensão/região)
# ==============================================================
def _dim_text(colname: str) -> str:
    return str(dim_por_col.get(colname, "") or "").lower()

# Dicas por dimensão (2ª linha do cabeçalho)
DIM_HINTS = {
    "temp": ["°c"," c","k","°k","temp","leito"],
    "flowN": ["nm3/h","knm3/h","mn3/h","nm3 h","knm3 h"],
    "press": ["pa","kpa","bar","mbar","mpa","press","dp"],
    "co2rho": ["kg/m3","kg/m³","densidade","co2"]
}

# Aliases ampliados (PT/EN/siglas)
ALIASES = {
    "tau_main": ["tau_densa","tau","t_leito","leito_temp_average","bed_temp","temperatura"],
    "temp_leito": ["leito_temp_average","bed_temp","temperatura_leito","fornalha_leito_temp_average"],
    "air_total": ["air_total_knm3_h","air_total_nm3_h","air_total_mn3_h","air_total"],
    "air_primary": ["air_primary_knm3_h","air_primary_nm3_h","air_primary"],
    "air_secondary":["air_secondary_knm3_h","air_secondary_nm3_h","air_secondary"],
    "dp_furn": ["dp_fornalha","delta_p_fornalha","fornalha_dp","dp_furnace"],
    "p_furn_a": ["pressao_fornalha_a","furnace_a_pressure","fornalha_a_press"],
    "p_furn_b": ["pressao_fornalha_b","furnace_b_pressure","fornalha_b_press"],
    "p_abs": ["pressao_fornalha","furnace_pressure","pressao_plenum","pressao_leito","pressao_tambor"],
    "rho_co2": ["densidade_co2","co2_density","rho_co2"],
    "o2": ["o2_medio","o2_excess_pct","oxygen"]
}

def _choose(qs, kind_hint=None):
    best, best_s = None, -1.0
    for q in qs:
        c, s = best_match(q, colunas)
        if c:
            # bônus por dimensão, se solicitado
            if kind_hint == "temp" and any(h in _dim_text(c) for h in DIM_HINTS["temp"]): s += 0.12
            if kind_hint == "flowN" and any(h in _dim_text(c) for h in DIM_HINTS["flowN"]): s += 0.12
            if kind_hint == "press" and any(h in _dim_text(c) for h in DIM_HINTS["press"]): s += 0.12
            if s > best_s:
                best, best_s = c, s
    return best, best_s

tau_main, _       = _choose(ALIASES["tau_main"], kind_hint="temp")
temp_leito_col, _ = _choose(ALIASES["temp_leito"], kind_hint="temp")
air_total_col,_   = _choose(ALIASES["air_total"], kind_hint="flowN")
air_pri_col,_     = _choose(ALIASES["air_primary"], kind_hint="flowN")
air_sec_col,_     = _choose(ALIASES["air_secondary"], kind_hint="flowN")
dp_furn_col,_     = _choose(ALIASES["dp_furn"], kind_hint="press")
p_furn_a_col,_    = _choose(ALIASES["p_furn_a"], kind_hint="press")
p_furn_b_col,_    = _choose(ALIASES["p_furn_b"], kind_hint="press")
p_abs_col,_       = _choose(ALIASES["p_abs"], kind_hint="press")
rho_co2_col,_     = _choose(ALIASES["rho_co2"], kind_hint=None)
o2_col,_          = _choose(ALIASES["o2"], kind_hint=None)

print("\n[Mapeamento] Colunas encontradas:")
for label, name in [
    ("tau_main", tau_main), ("temp_leito", temp_leito_col),
    ("air_total", air_total_col), ("air_primary", air_pri_col), ("air_secondary", air_sec_col),
    ("dp_furn", dp_furn_col), ("p_furn_a", p_furn_a_col), ("p_furn_b", p_furn_b_col),
    ("p_abs", p_abs_col), ("rho_co2", rho_co2_col), ("o2", o2_col)
]:
    print(f"  - {label:12s}: {name}")

# ==============================================================
# (5) PROXIES DIMENSIONAIS + CÁLCULO Wr/Wm
# ==============================================================
def _to_pa(x, dimtxt):
    s = (dimtxt or "").lower()
    v = pd.to_numeric(x, errors="coerce")
    if "mpa" in s:   return v * 1e6
    if "kpa" in s:   return v * 1e3
    if "bar" in s:   return v * 1.0e5
    if "mbar" in s:  return v * 1.0e2
    return v  # assume Pa

def _tk_from_c_or_k(series, dimtxt, fallback_name=""):
    # regra adicional: se o nome contiver 'tau' e a dimensão for desconhecida, assumir °C (seu comentário)
    if (("°c" in (dimtxt or "").lower()) or (("c" in (dimtxt or "").lower() and "k" not in (dimtxt or "").lower()))
        or ("tau" in fallback_name.lower() and not dimtxt)):
        return pd.to_numeric(series, errors="coerce") + 273.15
    return pd.to_numeric(series, errors="coerce")

# --- Temperatura em Kelvin (TK) ---
if tau_main:
    TK = _tk_from_c_or_k(df[tau_main], _dim_text(tau_main), fallback_name=tau_main)
elif temp_leito_col:
    TK = _tk_from_c_or_k(df[temp_leito_col], _dim_text(temp_leito_col), fallback_name=temp_leito_col)
else:
    raise RuntimeError("Não encontrei coluna de temperatura (TAU/LEITO). Ajuste aliases ou dimensões.")

# --- ΔP e P_abs ---
if dp_furn_col:
    DP = _to_pa(df[dp_furn_col], _dim_text(dp_furn_col))
elif p_furn_a_col and p_furn_b_col:
    Pa = _to_pa(df[p_furn_a_col], _dim_text(p_furn_a_col))
    Pb = _to_pa(df[p_furn_b_col], _dim_text(p_furn_b_col))
    DP = (Pa - Pb).abs()
else:
    DP = pd.Series(np.nan, index=df.index)

if p_abs_col:
    Pabs = _to_pa(df[p_abs_col], _dim_text(p_abs_col))
    # se for muito baixa (aparente gauge), somar atmosfera
    if Pabs.median(skipna=True) < 2e5:
        Pabs = Pabs + P_N
else:
    Pabs = pd.Series(P_N, index=df.index)

# --- Vazão real (m3/s) a partir de Nm3/h ---
QN_total = None
if air_total_col:
    QN_total = pd.to_numeric(df[air_total_col], errors="coerce")
else:
    parts = []
    if air_pri_col: parts.append(pd.to_numeric(df[air_pri_col], errors="coerce"))
    if air_sec_col: parts.append(pd.to_numeric(df[air_sec_col], errors="coerce"))
    if parts:
        QN_total = sum(parts)

if QN_total is not None:
    Q_real = (QN_total * 1000.0 / 3600.0) * (TK / T_N) * (P_N / Pabs)
else:
    Q_real = pd.Series(np.nan, index=df.index)

# --- Velocidade ν (m/s) ---
nu_base = Q_real / A_EF
DP_ref  = np.nanmedian(DP) if np.isfinite(DP).any() else 1.0
nu_proxy = nu_base * np.power(np.maximum(DP, 1.0) / max(DP_ref, 1.0), BETA_DP)

# --- Fração de CO2 (γ) por densidade CO2, se houver ---
if rho_co2_col:
    rho_co2_meas = pd.to_numeric(df[rho_co2_col], errors="coerce")
    rho_co2_pure = (Pabs * M_CO2) / (R * TK)
    gamma = np.clip(rho_co2_meas / np.maximum(rho_co2_pure, 1e-9), 0.0, 1.0)
else:
    gamma = pd.Series(GAMMA_DEFAULT, index=df.index)

# --- Densidade do gás ρg (kg/m3) ---
M_mix = (1 - gamma) * M_AIR + gamma * M_CO2
rho_g = (Pabs * M_mix) / (R * TK)

# --- Viscosidade μ_gas (Pa·s) via Sutherland (ar/CO2) & mistura ---
def mu_air_suth(TK):
    mu0, T0, S = 1.716e-5, 273.15, 111.0
    return mu0 * ((T0 + S) / (TK + S)) * (TK / T0) ** 1.5

def mu_co2_suth(TK):
    mu0, T0, S = 1.37e-5, 273.15, 240.0  # aproximação
    return mu0 * ((T0 + S) / (TK + S)) * (TK / T0) ** 1.5

mu_gas = (1 - gamma) * mu_air_suth(TK) + gamma * mu_co2_suth(TK)

# --- Diâmetro δ (m) por proxy dimensional (Ergun-like) ---
epsP = 1.0  # Pa
delta_proxy = (rho_g * (nu_proxy ** 2) * L_REF) / np.maximum(DP, epsP)   # [m]
delta_m = C_DELTA * delta_proxy  # ganho para faixa ~0.1–0.3 mm

# --- Fatores adimensionais g e h ---
sigma_mat   = pd.Series(SIGMA_MAT_BASE,   index=df.index)  # sem coluna específica → BASE
rho_fuel    = pd.Series(RHO_FUEL_BASE,    index=df.index)
sigma_steel = pd.Series(SIGMA_STEEL_BASE, index=df.index)

g = sigma_mat / np.maximum(rho_fuel, 1e-12)
h = sigma_steel / np.maximum(mu_gas,  1e-12)

# --- EQUAÇÕES Wr/Wm ---
nu_pos    = np.maximum(nu_proxy.fillna(0.0), 0.0)
delta_pos = np.maximum(delta_m.fillna(np.nan), 1e-9)
TK_pos    = np.maximum(TK.fillna(np.nan), 1.0)

wr = (ALPHA_BASE * (nu_pos ** 2.0) * (delta_pos ** 1.0) * np.exp(-TK_pos / TCRIT_BASE) * g).astype(float)
wm = (BETA_BASE  * (nu_pos ** 3.0) * np.sqrt(delta_pos)    * np.exp(-TK_pos / TOPT_BASE)  * h).astype(float)

wr_dim = "kg/m2·h"
wm_dim = "kg/m2·h"

# ==============================================================
# (6) SALVAR ARQUIVOS + LEAK-GUARD + EDA RESUMO
# ==============================================================
# DataFrame rotulado
df_labeled = df.copy()
df_labeled["wr_kg_m2_h"] = wr
df_labeled["wm_kg_m2_h"] = wm

# Atualiza dicionário de dimensões
dim_por_col["wr_kg_m2_h"] = wr_dim
dim_por_col["wm_kg_m2_h"] = wm_dim

# -- Leak-guard: todas as colunas usadas para targets + bloco físico (TAU_DENSA..fim)
used_cols_for_targets = set()
for name in [tau_main, temp_leito_col, air_total_col, air_pri_col, air_sec_col,
             dp_furn_col, p_furn_a_col, p_furn_b_col, p_abs_col, rho_co2_col, o2_col]:
    if name: used_cols_for_targets.add(name)
used_cols_for_targets.update(colunas[idx_tau:])

print("\n[Leak-guard] Colunas usadas no cálculo Wr/Wm (serão EXCLUÍDAS das features):")
for c in sorted(used_cols_for_targets):
    print("  -", c)

# -- Salvar A1_ML_DL_rotulado.csv (2 cabeçalhos) --
with open(PATH_OUTL, "w", encoding="utf-8-sig", newline="") as f:
    f.write(",".join(df_labeled.columns) + "\n")
    f.write(",".join([str(dim_por_col.get(c, "")) for c in df_labeled.columns]) + "\n")
df_labeled.to_csv(PATH_OUTL, mode="a", index=False, header=False, encoding="utf-8-sig")
print(f"\n✅ Rotulado salvo: {PATH_OUTL}")

# -- Montar A1_ML_DL_features.csv (REMOVENDO o leak) --
orig_cols = [c for c in df.columns if c not in ["wr_kg_m2_h","wm_kg_m2_h"]]
cols_excluir = set(colunas[idx_tau:]).union(used_cols_for_targets)
cols_keep = [c for c in orig_cols if c not in cols_excluir] + ["wr_kg_m2_h","wm_kg_m2_h"]
df_features = df_labeled[cols_keep].copy()

with open(PATH_OUTF, "w", encoding="utf-8-sig", newline="") as f:
    f.write(",".join(df_features.columns) + "\n")
    f.write(",".join([str(dim_por_col.get(c, "")) for c in df_features.columns]) + "\n")
df_features.to_csv(PATH_OUTF, mode="a", index=False, header=False, encoding="utf-8-sig")
print(f"✅ Features (SEM colunas usadas no cálculo) salvo: {PATH_OUTF}")

# -- EDA resumo dos rótulos --
def _eda_rapida(s: pd.Series, nome: str):
    s = pd.to_numeric(s, errors="coerce")
    print(f"\n--- {nome} ---")
    print("N válidos:", s.notna().sum(), "| N nulos:", s.isna().sum())
    desc = s.describe(percentiles=[0.01,0.05,0.25,0.5,0.75,0.95,0.99])
    print(desc.to_string())

_eda_rapida(df_labeled["wr_kg_m2_h"], "wr_kg_m2_h")
_eda_rapida(df_labeled["wm_kg_m2_h"], "wm_kg_m2_h")

print("\nConcluído.")


Início do bloco físico para cálculo: coluna #151 -> tau_densa (até o final). Total usadas: 16

[Mapeamento] Colunas encontradas:
  - tau_main    : leito_temp_average
  - temp_leito  : leito_temp_average
  - air_total   : air_total_knm3_h
  - air_primary : air_primary_knm3_h
  - air_secondary: air_secondary_knm3_h
  - dp_furn     : pressao_fornalha
  - p_furn_a    : pressao_fornalha_a_inf
  - p_furn_b    : pressao_fornalha_b_inf
  - p_abs       : pressao_fornalha
  - rho_co2     : o2_medio
  - o2          : o2_medio

[Leak-guard] Colunas usadas no cálculo Wr/Wm (serão EXCLUÍDAS das features):
  - air_primary_knm3_h
  - air_primary_nm3_h
  - air_primary_press_z
  - air_primary_share_pct
  - air_secondary_knm3_h
  - air_secondary_nm3_h
  - air_secondary_press_z
  - air_total_knm3_h
  - air_total_nm3_h
  - coal_flow_furnace_t_h
  - coal_flow_total_ref_t_h
  - leito_temp_average
  - o2_excess_pct
  - o2_medio
  - pressao_fornalha
  - pressao_fornalha_a_inf
  - pressao_fornalha_b_inf
  - tau

In [10]:
# ==============================================================
# (1) IMPORTS & CONFIGURAÇÃO
# ==============================================================
import os, re, unicodedata
import numpy as np
import pandas as pd
from difflib import SequenceMatcher

# Caminhos
CURATED_DIR = r"C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated"
PATH_IN     = os.path.join(CURATED_DIR, "A1_ML_DL.csv")
PATH_OUTL   = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v2.csv")
PATH_OUTF   = os.path.join(CURATED_DIR, "A1_ML_DL_features_v2.csv")

# Constantes físicas
T_N, P_N = 273.15, 1.01325e5   # K, Pa
R = 8.314462618                # J/mol/K
M_AIR, M_CO2 = 0.029, 0.044    # kg/mol

# Parâmetros de engenharia (ajuste se conhecer a planta)
A_EF   = 15.0     # m²  (área efetiva)
L_REF  = 3.0      # m   (altura efetiva para DP)
BETA_DP = 0.25    # expoente fraco para ajuste por ΔP
C_DELTA = 1e-3    # ganho para trazer δ para ~0.1–0.3 mm (m)
GAMMA_DEFAULT = 0.25  # fração CO2 padrão (se não houver dado)

# Constantes BASE de Wr/Wm
ALPHA_BASE = 4.2e-3
BETA_BASE  = 1.2e-4
TCRIT_BASE = 1200.0
TOPT_BASE  = 1250.0
SIGMA_MAT_BASE   = 6.0e9
RHO_FUEL_BASE    = 700.0
SIGMA_STEEL_BASE = 2.5e9

# ==============================================================
# (2) FUNÇÕES AUXILIARES
# ==============================================================
def strip_accents(s: str) -> str:
    return "".join(c for c in unicodedata.normalize("NFKD", str(s)) if not unicodedata.combining(c))

def normalize_token(s: str) -> str:
    s = strip_accents(s).lower()
    s = re.sub(r"[^a-z0-9_\s]+", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def split_tokens(s: str):
    return set(t for t in re.split(r"[ _]+", normalize_token(s)) if t)

def hybrid_similarity(a: str, b: str) -> float:
    ta, tb = split_tokens(a), split_tokens(b)
    jac = len(ta & tb) / len(ta | tb) if (ta and tb) else 0.0
    seq = SequenceMatcher(None, normalize_token(a), normalize_token(b)).ratio()
    return 0.6*jac + 0.4*seq

def best_match(query: str, candidates: list):
    scores = [(c, hybrid_similarity(query, c)) for c in candidates]
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[0] if scores else (None, 0.0)

def dim_text(colname: str) -> str:
    return str(dim_por_col.get(colname, "") or "").lower()

def to_pa(x, dimtxt):
    s = (dimtxt or "").lower()
    v = pd.to_numeric(x, errors="coerce")
    if "mpa" in s:   return v * 1e6
    if "kpa" in s:   return v * 1e3
    if "bar" in s:   return v * 1.0e5
    if "mbar" in s:  return v * 1.0e2
    return v  # assume Pa

def TK_from_series(series, dimtxt, fallback_name=""):
    # Se for °C (ou nome contém 'tau' e dimensão ausente), converte para K
    is_c = ("°c" in (dimtxt or "").lower()) or (("c" in (dimtxt or "").lower() and "k" not in (dimtxt or "").lower()))
    if is_c or ("tau" in fallback_name.lower() and not dimtxt):
        return pd.to_numeric(series, errors="coerce") + 273.15
    return pd.to_numeric(series, errors="coerce")

def mu_air_suth(TK):
    mu0, T0, S = 1.716e-5, 273.15, 111.0
    return mu0 * ((T0 + S) / (TK + S)) * (TK / T0) ** 1.5

def mu_co2_suth(TK):
    mu0, T0, S = 1.37e-5, 273.15, 240.0  # aproximação para CO2
    return mu0 * ((T0 + S) / (TK + S)) * (TK / T0) ** 1.5

def fill_blocks_avg(series: pd.Series) -> pd.Series:
    vals = series.values.astype(float)
    n = len(vals)
    i = 0
    while i < n:
        if not np.isnan(vals[i]):
            i += 1
            continue
        start = i
        while i < n and np.isnan(vals[i]):
            i += 1
        end = i - 1
        # vizinhos válidos
        prev_idx = start - 1
        while prev_idx >= 0 and np.isnan(vals[prev_idx]):
            prev_idx -= 1
        next_idx = end + 1
        while next_idx < n and np.isnan(vals[next_idx]):
            next_idx += 1
        if prev_idx >= 0 and next_idx < n and np.isfinite(vals[prev_idx]) and np.isfinite(vals[next_idx]):
            vals[start:end+1] = 0.5 * (vals[prev_idx] + vals[next_idx])
        # bordas ficam NaN
    return pd.Series(vals, index=series.index)

# ==============================================================
# (3) LEITURA DO CSV (2 CABEÇALHOS) E PREPARO
# ==============================================================
df_raw = pd.read_csv(PATH_IN, header=[0,1], engine="python")
dim_por_col = {col: dim for (col, dim) in df_raw.columns}
df = df_raw.copy()
df.columns = [col for (col, dim) in df_raw.columns]
colunas = list(df.columns)

# Localiza início do bloco físico (TAU_DENSA → fim). Se não existir, fallback = 152º (1-indexado)
if "TAU_DENSA" in colunas:
    idx_tau = colunas.index("TAU_DENSA")
else:
    c, _ = best_match("tau_densa", colunas)
    idx_tau = colunas.index(c) if c else 151
print(f"Início do bloco físico para cálculo: coluna #{idx_tau+1} -> {colunas[idx_tau]} (até o final). Total usadas: {len(colunas[idx_tau:])}")

# ==============================================================
# (4) MAPEAMENTO SEMÂNTICO DAS COLUNAS-CHAVE
# ==============================================================
ALIASES = {
    "tau_main":     ["tau_densa","tau","t_leito","leito_temp_average","bed_temp","temperatura"],
    "temp_leito":   ["leito_temp_average","bed_temp","temperatura_leito","fornalha_leito_temp_average"],
    "air_total":    ["air_total_knm3_h","air_total_nm3_h","air_total_mn3_h","air_total"],
    "air_primary":  ["air_primary_knm3_h","air_primary_nm3_h","air_primary"],
    "air_secondary":["air_secondary_knm3_h","air_secondary_nm3_h","air_secondary"],
    "dp_furn":      ["dp_fornalha","delta_p_fornalha","fornalha_dp","dp_furnace"],
    "p_furn_a":     ["pressao_fornalha_a_inf","pressao_fornalha_a","furnace_a_pressure","fornalha_a_press"],
    "p_furn_b":     ["pressao_fornalha_b_inf","pressao_fornalha_b","furnace_b_pressure","fornalha_b_press"],
    "p_abs":        ["pressao_fornalha","furnace_pressure","pressao_plenum","pressao_leito","pressao_tambor"],
    "rho_co2":      ["densidade_co2","co2_density","rho_co2"],  # NÃO mapear para O2
    "o2":           ["o2_medio","o2_excess_pct","oxygen"]
}
def choose(qs):
    best, best_s = None, -1.0
    for q in qs:
        c, s = best_match(q, colunas)
        if s > best_s:
            best, best_s = c, s
    return best

tau_main       = choose(ALIASES["tau_main"])
temp_leito_col = choose(ALIASES["temp_leito"])
air_total_col  = choose(ALIASES["air_total"])
air_pri_col    = choose(ALIASES["air_primary"])
air_sec_col    = choose(ALIASES["air_secondary"])
dp_furn_col    = choose(ALIASES["dp_furn"])
p_furn_a_col   = choose(ALIASES["p_furn_a"])
p_furn_b_col   = choose(ALIASES["p_furn_b"])
p_abs_col      = choose(ALIASES["p_abs"])
rho_co2_col    = choose(ALIASES["rho_co2"])
o2_col         = choose(ALIASES["o2"])

print("\n[Mapeamento] Colunas encontradas:")
for label, name in [
    ("tau_main", tau_main), ("temp_leito", temp_leito_col),
    ("air_total", air_total_col), ("air_primary", air_pri_col), ("air_secondary", air_sec_col),
    ("dp_furn", dp_furn_col), ("p_furn_a", p_furn_a_col), ("p_furn_b", p_furn_b_col),
    ("p_abs", p_abs_col), ("rho_co2", rho_co2_col), ("o2", o2_col)
]:
    print(f"  - {label:12s}: {name}")

# ==============================================================
# (5) TEMPERATURA (K), PRESSÕES (Pa) E ΔP ROBUSTO
# ==============================================================
# Temperatura (TK)
if tau_main:
    TK = TK_from_series(df[tau_main], dim_text(tau_main), fallback_name=tau_main)
elif temp_leito_col:
    TK = TK_from_series(df[temp_leito_col], dim_text(temp_leito_col), fallback_name=temp_leito_col)
else:
    raise RuntimeError("Não encontrei coluna de temperatura (TAU/LEITO).")

# Pressão absoluta (Pabs)
if p_abs_col:
    Pabs = to_pa(df[p_abs_col], dim_text(p_abs_col))
    # se parecer gauge, soma atmosfera
    if Pabs.median(skipna=True) < 2e5:
        Pabs = Pabs + P_N
else:
    Pabs = pd.Series(P_N, index=df.index)

# ΔP (preferir dp_furn se for realmente ΔP; senão, usar |A−B|)
DP_from_col = to_pa(df[dp_furn_col], dim_text(dp_furn_col)) if dp_furn_col else pd.Series(np.nan, index=df.index)
DP_from_ab  = None
if p_furn_a_col and p_furn_b_col:
    Pa = to_pa(df[p_furn_a_col], dim_text(p_furn_a_col))
    Pb = to_pa(df[p_furn_b_col], dim_text(p_furn_b_col))
    DP_from_ab = (Pa - Pb).abs()

def looks_like_absolute(dp_name: str) -> bool:
    if not dp_name: return False
    n = dp_name.lower()
    return ("delta" not in n) and ("dp" not in n)

use_ab_all = False
if dp_furn_col and p_abs_col and (dp_furn_col == p_abs_col):
    use_ab_all = True
elif dp_furn_col and looks_like_absolute(dp_furn_col) and DP_from_ab is not None:
    use_ab_all = True

if use_ab_all and DP_from_ab is not None:
    DP = DP_from_ab.copy()
else:
    DP = DP_from_col.copy()
    if DP_from_ab is not None:
        need = ~np.isfinite(DP) & np.isfinite(DP_from_ab)
        DP.loc[need] = DP_from_ab.loc[need]

# ==============================================================
# (6) VAZÃO TOTAL DE AR (Nm3/h) — PREENCHIMENTO E PERSISTÊNCIA
# ==============================================================
QN_total_total = pd.to_numeric(df[air_total_col], errors="coerce") if air_total_col else pd.Series(np.nan, index=df.index)
QN_primary     = pd.to_numeric(df[air_pri_col],   errors="coerce") if air_pri_col   else pd.Series(np.nan, index=df.index)
QN_secondary   = pd.to_numeric(df[air_sec_col],   errors="coerce") if air_sec_col   else pd.Series(np.nan, index=df.index)

QN_sum_ps = QN_primary.add(QN_secondary, fill_value=np.nan)

QN0 = QN_total_total.copy()
mask_fill_sum = QN0.isna() & QN_sum_ps.notna()
QN0.loc[mask_fill_sum] = QN_sum_ps.loc[mask_fill_sum]

QN_total_filled = fill_blocks_avg(QN0)

print(f"\n[QN_total] NaNs (total): {int(QN_total_total.isna().sum())} | "
      f"após prim+sec: {int(QN0.isna().sum())} | após blocos: {int(QN_total_filled.isna().sum())}")

# Persistir para auditoria e diagnóstico
df["air_total_knm3_h_filled"] = QN_total_filled
dim_por_col["air_total_knm3_h_filled"] = "kNm3/h (preenchido)"

# ==============================================================
# (7) PROXIES: Q_real, ν, γ, ρg, μgas, δ
# ==============================================================
# Q_real (m3/s) no ponto de operação
Q_real = (QN_total_filled * 1000.0 / 3600.0) * (TK / T_N) * (P_N / Pabs)

# Velocidade ν (m/s)
nu_base = Q_real / A_EF
DP_ref  = np.nanmedian(DP) if np.isfinite(DP).any() else 1.0
nu_proxy = nu_base * np.power(np.maximum(DP, 1.0) / max(DP_ref, 1.0), BETA_DP)

# Fração de CO2 (γ) — se existir densidade de CO2; senão, default
if rho_co2_col:
    rho_co2_meas = pd.to_numeric(df[rho_co2_col], errors="coerce")
    rho_co2_pure = (Pabs * M_CO2) / (R * TK)
    gamma = np.clip(rho_co2_meas / np.maximum(rho_co2_pure, 1e-9), 0.0, 1.0)
else:
    gamma = pd.Series(GAMMA_DEFAULT, index=df.index)

# Densidade do gás (kg/m3)
M_mix = (1 - gamma) * M_AIR + gamma * M_CO2
rho_g = (Pabs * M_mix) / (R * TK)

# Viscosidade do gás (Pa·s) — mistura ar/CO2 via Sutherland
mu_gas = (1 - gamma) * mu_air_suth(TK) + gamma * mu_co2_suth(TK)

# Diâmetro δ (m) — Ergun-like + ganho C_DELTA
epsP = 1.0  # Pa
delta_proxy = (rho_g * (nu_proxy ** 2) * L_REF) / np.maximum(DP, epsP)
delta_m = C_DELTA * delta_proxy

# ==================== PATCH PARA CORRIGIR WM (cole após a seção 7) ====================
print("\n[FIX-WM] Verificando mapeamento de rho_co2...")

# 1) Se rho_co2 caiu em o2_medio, ignorar esse mapeamento (usar default)
if 'rho_co2_col' in globals() and 'o2_col' in globals() and (rho_co2_col is not None) and (o2_col is not None):
    if rho_co2_col == o2_col:
        print("  - rho_co2 estava mapeado para O2. Ignorando e usando GAMMA_DEFAULT.")
        rho_co2_col = None

# 2) Recalcular gamma de forma robusta (sem usar O2 por padrão)
if rho_co2_col:
    rho_co2_meas = pd.to_numeric(df[rho_co2_col], errors="coerce")
    rho_co2_pure = (Pabs * M_CO2) / (R * TK)
    gamma = np.clip(rho_co2_meas / np.maximum(rho_co2_pure, 1e-9), 0.0, 1.0)
else:
    gamma = pd.Series(GAMMA_DEFAULT, index=df.index)

gamma = gamma.fillna(GAMMA_DEFAULT).clip(0.0, 1.0)

# 3) Recalcular μ_gas (mistura ar/CO2) com a nova gamma
mu_air = mu_air_suth(TK)
mu_co2 = mu_co2_suth(TK)
mu_gas = (1 - gamma) * mu_air + gamma * mu_co2
mu_gas = pd.to_numeric(mu_gas, errors="coerce").fillna(mu_air_suth(TK))  # fallback: ar puro

# 4) Recalcular h e Wm (Wr não depende de μ_gas)
sigma_steel = pd.Series(SIGMA_STEEL_BASE, index=df.index)  # pode trocar por coluna se existir
h = sigma_steel / np.maximum(mu_gas, 1e-12)

nu_pos    = np.maximum(nu_proxy.fillna(0.0), 0.0)
delta_pos = np.maximum(delta_m.fillna(np.nan), 1e-9)
TK_pos    = np.maximum(TK.fillna(np.nan), 1.0)

wm = (BETA_BASE * (nu_pos ** 3.0) * np.sqrt(delta_pos) * np.exp(-TK_pos / TOPT_BASE) * h).astype(float)

# 5) Atualizar no df_labeled e salvar v3 (mantendo 2 linhas de cabeçalho)
df_labeled = df.copy()
df_labeled["wr_kg_m2_h"] = wr  # Wr do cálculo vigente
df_labeled["wm_kg_m2_h"] = wm  # Wm corrigido

dim_por_col["wr_kg_m2_h"] = "kg/m2·h"
dim_por_col["wm_kg_m2_h"] = "kg/m2·h"

PATH_OUTL3 = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v3.csv")
with open(PATH_OUTL3, "w", encoding="utf-8-sig", newline="") as f:
    f.write(",".join(df_labeled.columns) + "\n")
    f.write(",".join([str(dim_por_col.get(c, "")) for c in df_labeled.columns]) + "\n")
df_labeled.to_csv(PATH_OUTL3, mode="a", index=False, header=False, encoding="utf-8-sig")
print(f"✅ Rotulado (Wm corrigido) salvo: {PATH_OUTL3}")

# 6) Refazer FEATURES v3, excluindo tudo que foi usado no cálculo (inclui 'air_total_knm3_h_filled' e TAU_DENSA→fim)
if 'used_cols_for_targets' not in globals():
    used_cols_for_targets = set()
used_cols_for_targets.update([tau_main, temp_leito_col, air_total_col, air_pri_col, air_sec_col,
                              dp_furn_col, p_furn_a_col, p_furn_b_col, p_abs_col,
                              rho_co2_col, o2_col, "air_total_knm3_h_filled"])
used_cols_for_targets = {c for c in used_cols_for_targets if c}
used_cols_for_targets.update(colunas[idx_tau:])

orig_cols = [c for c in df.columns if c not in ["wr_kg_m2_h","wm_kg_m2_h"]]
cols_excluir = set(colunas[idx_tau:]).union(used_cols_for_targets)
cols_keep = [c for c in orig_cols if c not in cols_excluir] + ["wr_kg_m2_h","wm_kg_m2_h"]

df_features = df_labeled[cols_keep].copy()
PATH_OUTF3 = os.path.join(CURATED_DIR, "A1_ML_DL_features_v3.csv")
with open(PATH_OUTF3, "w", encoding="utf-8-sig", newline="") as f:
    f.write(",".join(df_features.columns) + "\n")
    f.write(",".join([str(dim_por_col.get(c, "")) for c in df_features.columns]) + "\n")
df_features.to_csv(PATH_OUTF3, mode="a", index=False, header=False, encoding="utf-8-sig")
print(f"✅ Features (v3, sem vazamento) salvo: {PATH_OUTF3}")

# 7) EDA rápida pós-correção
def _eda_rapida(s: pd.Series, nome: str):
    s = pd.to_numeric(s, errors="coerce")
    print(f"\n--- {nome} ---")
    print("N válidos:", s.notna().sum(), "| N nulos:", s.isna().sum())
    print(s.describe(percentiles=[0.01,0.05,0.25,0.5,0.75,0.95,0.99]).to_string())

_eda_rapida(df_labeled["wr_kg_m2_h"], "wr_kg_m2_h (v3)")
_eda_rapida(df_labeled["wm_kg_m2_h"], "wm_kg_m2_h (v3)")
# ==================== FIM DO PATCH ====================
# ==================== FIX DEFINITIVO DE WM (cole após a Seção 7) ====================

print("\n[FIX-WM v4] Robustecendo μ_gas e Wm...")

# (1) Blindagem: se rho_co2 mapeou para algo com 'o2'/'oxygen', invalida (usa GAMMA_DEFAULT)
if 'rho_co2_col' in globals() and rho_co2_col is not None:
    name_low = str(rho_co2_col).lower()
    if ('o2' in name_low) or ('oxygen' in name_low):
        print(f"  - rho_co2='{rho_co2_col}' parece O2; ignorando e usando GAMMA_DEFAULT.")
        rho_co2_col = None

# (2) Gamma robusto (sem usar O2). Se densidade de CO2 existir, usa relação ideal; senão, constante.
if rho_co2_col:
    rho_co2_meas = pd.to_numeric(df[rho_co2_col], errors="coerce")
    rho_co2_pure = (Pabs * M_CO2) / (R * TK)                 # pode dar NaN se TK NaN
    gamma = np.clip(rho_co2_meas / np.maximum(rho_co2_pure, 1e-9), 0.0, 1.0)
else:
    gamma = pd.Series(GAMMA_DEFAULT, index=df.index)

gamma = gamma.replace([np.inf, -np.inf], np.nan).fillna(GAMMA_DEFAULT).clip(0.0, 1.0)

# (3) TK para viscosidade: se TK for NaN, usa 1100 K (faixa típica do leito)
TK_mu = TK.copy()
if isinstance(TK_mu, pd.Series):
    TK_mu = TK_mu.where(np.isfinite(TK_mu), 1100.0)
else:
    TK_mu = 1100.0 if not np.isfinite(TK_mu) else TK_mu

# (4) μ_gas pela mistura ar/CO2 com TK robusto (garantir finito)
mu_air = mu_air_suth(TK_mu)
mu_co2 = mu_co2_suth(TK_mu)
mu_gas = (1 - gamma) * mu_air + gamma * mu_co2
# fallback de segurança absoluto:
mu_gas = pd.to_numeric(mu_gas, errors="coerce")
mu_gas = mu_gas.replace([np.inf, -np.inf], np.nan).fillna(mu_air_suth(pd.Series(np.full(len(df), 1100.0))))

# (5) Recalcular h e Wm com μ_gas robusto (Wr fica como está)
sigma_steel = pd.Series(SIGMA_STEEL_BASE, index=df.index)  # use coluna se houver
h = sigma_steel / np.maximum(mu_gas, 1e-12)

nu_pos    = np.maximum(nu_proxy.fillna(0.0), 0.0)
delta_pos = np.maximum(delta_m.fillna(np.nan), 1e-9)
TK_pos    = np.maximum(TK.fillna(np.nan), 1.0)

wm = (BETA_BASE * (nu_pos ** 3.0) * np.sqrt(delta_pos) * np.exp(-TK_pos / TOPT_BASE) * h).astype(float)

print(f"  - μ_gas NaNs após fix: {int(pd.isna(mu_gas).sum())}")

# (6) Atualizar rotulado e features — V4
df_labeled_v4 = df.copy()
df_labeled_v4["wr_kg_m2_h"] = wr              # WR do cálculo vigente
df_labeled_v4["wm_kg_m2_h"] = wm              # WM corrigido e robusto

dim_por_col["wr_kg_m2_h"] = "kg/m2·h"
dim_por_col["wm_kg_m2_h"] = "kg/m2·h"

PATH_OUTL4 = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v4.csv")
with open(PATH_OUTL4, "w", encoding="utf-8-sig", newline="") as f:
    f.write(",".join(df_labeled_v4.columns) + "\n")
    f.write(",".join([str(dim_por_col.get(c, "")) for c in df_labeled_v4.columns]) + "\n")
df_labeled_v4.to_csv(PATH_OUTL4, mode="a", index=False, header=False, encoding="utf-8-sig")
print(f"✅ Rotulado (v4) salvo: {PATH_OUTL4}")

# Reutiliza a mesma lista de exclusões (used_cols_for_targets + TAU_DENSA→fim)
if 'used_cols_for_targets' not in globals():
    used_cols_for_targets = set()
used_cols_for_targets.update([
    tau_main, temp_leito_col, air_total_col, air_pri_col, air_sec_col,
    dp_furn_col, p_furn_a_col, p_furn_b_col, p_abs_col, rho_co2_col,
    o2_col, "air_total_knm3_h_filled"
])
used_cols_for_targets = {c for c in used_cols_for_targets if c}
used_cols_for_targets.update(colunas[idx_tau:])

orig_cols = [c for c in df.columns if c not in ["wr_kg_m2_h","wm_kg_m2_h"]]
cols_excluir = set(colunas[idx_tau:]).union(used_cols_for_targets)
cols_keep = [c for c in orig_cols if c not in cols_excluir] + ["wr_kg_m2_h","wm_kg_m2_h"]

df_features_v4 = df_labeled_v4[cols_keep].copy()
PATH_OUTF4 = os.path.join(CURATED_DIR, "A1_ML_DL_features_v4.csv")
with open(PATH_OUTF4, "w", encoding="utf-8-sig", newline="") as f:
    f.write(",".join(df_features_v4.columns) + "\n")
    f.write(",".join([str(dim_por_col.get(c, "")) for c in df_features_v4.columns]) + "\n")
df_features_v4.to_csv(PATH_OUTF4, mode="a", index=False, header=False, encoding="utf-8-sig")
print(f"✅ Features (v4, sem vazamento) salvo: {PATH_OUTF4}")

# (7) EDA rápida pós-fix
def _eda(s: pd.Series, nome: str):
    s = pd.to_numeric(s, errors="coerce")
    print(f"\n--- {nome} ---")
    print("N válidos:", s.notna().sum(), "| N nulos:", s.isna().sum())
    print(s.describe(percentiles=[0.01,0.05,0.25,0.5,0.75,0.95,0.99]).to_string())

_eda(df_labeled_v4["wr_kg_m2_h"], "wr_kg_m2_h (v4)")
_eda(df_labeled_v4["wm_kg_m2_h"], "wm_kg_m2_h (v4)")
# ==================== FIM DO FIX v4 ====================

# ==============================================================
# (8) Wr / Wm E FATORES g/h
# ==============================================================
sigma_mat   = pd.Series(SIGMA_MAT_BASE,   index=df.index)
rho_fuel    = pd.Series(RHO_FUEL_BASE,    index=df.index)
sigma_steel = pd.Series(SIGMA_STEEL_BASE, index=df.index)

g = sigma_mat / np.maximum(rho_fuel, 1e-12)
h = sigma_steel / np.maximum(mu_gas,  1e-12)

nu_pos    = np.maximum(nu_proxy.fillna(0.0), 0.0)
delta_pos = np.maximum(delta_m.fillna(np.nan), 1e-9)
TK_pos    = np.maximum(TK.fillna(np.nan), 1.0)

wr = (ALPHA_BASE * (nu_pos ** 2.0) * (delta_pos ** 1.0) * np.exp(-TK_pos / TCRIT_BASE) * g).astype(float)
wm = (BETA_BASE  * (nu_pos ** 3.0) * np.sqrt(delta_pos)    * np.exp(-TK_pos / TOPT_BASE)  * h).astype(float)

wr_dim = "kg/m2·h"
wm_dim = "kg/m2·h"

# ==============================================================
# (9) SALVAR ROTULADO + FEATURES (sem vazamento)
# ==============================================================
df_labeled = df.copy()
df_labeled["wr_kg_m2_h"] = wr
df_labeled["wm_kg_m2_h"] = wm

dim_por_col["wr_kg_m2_h"] = wr_dim
dim_por_col["wm_kg_m2_h"] = wm_dim

# Leak-guard: tudo que foi usado no cálculo + bloco TAU_DENSA→fim
used_cols_for_targets = set([
    tau_main, temp_leito_col, air_total_col, air_pri_col, air_sec_col,
    dp_furn_col, p_furn_a_col, p_furn_b_col, p_abs_col, rho_co2_col, o2_col,
    "air_total_knm3_h_filled"
])
used_cols_for_targets = {c for c in used_cols_for_targets if c}  # remove None
used_cols_for_targets.update(colunas[idx_tau:])

print("\n[Leak-guard] Colunas a excluir das features:")
for c in sorted(used_cols_for_targets):
    print("  -", c)

# (9.1) Salvar ROTULADO (2 linhas de cabeçalho)
with open(PATH_OUTL, "w", encoding="utf-8-sig", newline="") as f:
    f.write(",".join(df_labeled.columns) + "\n")
    f.write(",".join([str(dim_por_col.get(c, "")) for c in df_labeled.columns]) + "\n")
df_labeled.to_csv(PATH_OUTL, mode="a", index=False, header=False, encoding="utf-8-sig")
print(f"\n✅ Rotulado salvo: {PATH_OUTL}")

# (9.2) Montar FEATURES (removendo tudo que contribui para Wr/Wm)
orig_cols = [c for c in df.columns if c not in ["wr_kg_m2_h","wm_kg_m2_h"]]
cols_excluir = set(colunas[idx_tau:]).union(used_cols_for_targets)
cols_keep = [c for c in orig_cols if c not in cols_excluir] + ["wr_kg_m2_h","wm_kg_m2_h"]
df_features = df_labeled[cols_keep].copy()

with open(PATH_OUTF, "w", encoding="utf-8-sig", newline="") as f:
    f.write(",".join(df_features.columns) + "\n")
    f.write(",".join([str(dim_por_col.get(c, "")) for c in df_features.columns]) + "\n")
df_features.to_csv(PATH_OUTF, mode="a", index=False, header=False, encoding="utf-8-sig")
print(f"✅ Features (sem vazamento) salvo: {PATH_OUTF}")

# ==============================================================
# (10) RESUMO EDA DOS ALVOS
# ==============================================================
def eda_quick(s: pd.Series, nome: str):
    s = pd.to_numeric(s, errors="coerce")
    print(f"\n--- {nome} ---")
    print("N válidos:", s.notna().sum(), "| N nulos:", s.isna().sum())
    print(s.describe(percentiles=[0.01,0.05,0.25,0.5,0.75,0.95,0.99]).to_string())

eda_quick(df_labeled["wr_kg_m2_h"], "wr_kg_m2_h")
eda_quick(df_labeled["wm_kg_m2_h"], "wm_kg_m2_h")

print("\nConcluído.")


Início do bloco físico para cálculo: coluna #151 -> tau_densa (até o final). Total usadas: 16

[Mapeamento] Colunas encontradas:
  - tau_main    : tau_densa
  - temp_leito  : leito_temp_average
  - air_total   : air_total_knm3_h
  - air_primary : air_primary_knm3_h
  - air_secondary: air_secondary_knm3_h
  - dp_furn     : pressao_fornalha
  - p_furn_a    : pressao_fornalha_a_inf
  - p_furn_b    : pressao_fornalha_b_inf
  - p_abs       : pressao_fornalha
  - rho_co2     : o2_medio
  - o2          : o2_medio

[QN_total] NaNs (total): 4284 | após prim+sec: 4284 | após blocos: 1745

[FIX-WM] Verificando mapeamento de rho_co2...
  - rho_co2 estava mapeado para O2. Ignorando e usando GAMMA_DEFAULT.
✅ Rotulado (Wm corrigido) salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_rotulado_v3.csv
✅ Features (v3, sem vazamento) salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_features_v3.csv

--- wr_kg_m2_h (v3) ---
N 

In [1]:
# ==============================================================
# GOLD & SILVER a partir de A1_ML_DL_rotulado_v4.csv
# ==============================================================

import os, re, unicodedata
import numpy as np
import pandas as pd
from difflib import SequenceMatcher

# ---------- caminhos ----------
CURATED_DIR = r"C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated"
PATH_ROT_V4 = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v4.csv")

PATH_GOLD_L = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v4_gold.csv")
PATH_GOLD_F = os.path.join(CURATED_DIR, "A1_ML_DL_features_v4_gold.csv")

PATH_SILV_L = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v4_silver.csv")
PATH_SILV_F = os.path.join(CURATED_DIR, "A1_ML_DL_features_v4_silver.csv")

# ---------- constantes físicas e parâmetros ----------
T_N, P_N = 273.15, 1.01325e5   # K, Pa
R = 8.314462618
M_AIR, M_CO2 = 0.029, 0.044

A_EF   = 15.0     # m²
L_REF  = 3.0      # m
BETA_DP = 0.25
C_DELTA = 1e-3
GAMMA_DEFAULT = 0.25

ALPHA_BASE = 4.2e-3
BETA_BASE  = 1.2e-4
TCRIT_BASE = 1200.0
TOPT_BASE  = 1250.0
SIGMA_MAT_BASE   = 6.0e9
RHO_FUEL_BASE    = 700.0
SIGMA_STEEL_BASE = 2.5e9

# ---------- utilidades ----------
def strip_accents(s: str) -> str:
    return "".join(c for c in unicodedata.normalize("NFKD", str(s)) if not unicodedata.combining(c))
def normalize_token(s: str) -> str:
    s = strip_accents(s).lower()
    s = re.sub(r"[^a-z0-9_\s]+", " ", s)
    return re.sub(r"\s+", " ", s).strip()
def split_tokens(s: str): return set(t for t in re.split(r"[ _]+", normalize_token(s)) if t)
def hybrid_similarity(a: str, b: str) -> float:
    ta, tb = split_tokens(a), split_tokens(b)
    jac = len(ta & tb) / len(ta | tb) if (ta and tb) else 0.0
    seq = SequenceMatcher(None, normalize_token(a), normalize_token(b)).ratio()
    return 0.6*jac + 0.4*seq
def best_match(q, cands):
    sc = [(c, hybrid_similarity(q, c)) for c in cands]
    sc.sort(key=lambda x: x[1], reverse=True)
    return sc[0] if sc else (None, 0.0)

def read_multiheader_csv(path):
    dfr = pd.read_csv(path, header=[0,1], engine="python")
    dim = {c: d for (c,d) in dfr.columns}
    df  = dfr.copy()
    df.columns = [c for (c,d) in dfr.columns]
    return df, dim

def write_multiheader_csv(df, dim, path):
    with open(path, "w", encoding="utf-8-sig", newline="") as f:
        f.write(",".join(df.columns) + "\n")
        f.write(",".join([str(dim.get(c,"")) for c in df.columns]) + "\n")
    df.to_csv(path, mode="a", index=False, header=False, encoding="utf-8-sig")

def dim_text(dim_map, col): return str(dim_map.get(col,"") or "").lower()
def to_pa(x, dimtxt):
    s = (dimtxt or "").lower()
    v = pd.to_numeric(x, errors="coerce")
    if "mpa" in s:   return v * 1e6
    if "kpa" in s:   return v * 1e3
    if "bar" in s:   return v * 1e5
    if "mbar" in s:  return v * 1e2
    return v

def TK_from_series(series, dimtxt, fallback_name=""):
    is_c = ("°c" in (dimtxt or "").lower()) or (("c" in (dimtxt or "").lower() and "k" not in (dimtxt or "").lower()))
    if is_c or ("tau" in str(fallback_name).lower() and not dimtxt):
        return pd.to_numeric(series, errors="coerce") + 273.15
    return pd.to_numeric(series, errors="coerce")

def mu_air_suth(TK):
    mu0, T0, S = 1.716e-5, 273.15, 111.0
    return mu0 * ((T0 + S) / (TK + S)) * (TK / T0) ** 1.5
def mu_co2_suth(TK):
    mu0, T0, S = 1.37e-5, 273.15, 240.0
    return mu0 * ((T0 + S) / (TK + S)) * (TK / T0) ** 1.5

def fill_blocks_avg(series):
    vals = series.values.astype(float)
    n = len(vals); i = 0
    while i < n:
        if not np.isnan(vals[i]): i += 1; continue
        s = i
        while i < n and np.isnan(vals[i]): i += 1
        e = i - 1
        prev_i = s - 1
        while prev_i >= 0 and np.isnan(vals[prev_i]): prev_i -= 1
        next_i = e + 1
        while next_i < n and np.isnan(vals[next_i]): next_i += 1
        if prev_i >= 0 and next_i < n and np.isfinite(vals[prev_i]) and np.isfinite(vals[next_i]):
            vals[s:e+1] = 0.5*(vals[prev_i] + vals[next_i])
    return pd.Series(vals, index=series.index)

# ---------- 1) Ler o rotulado v4 ----------
df, dim_map = read_multiheader_csv(PATH_ROT_V4)
colunas = list(df.columns)

# localizar TAU_DENSA para o bloco físico
if "TAU_DENSA" in colunas:
    idx_tau = colunas.index("TAU_DENSA")
else:
    c,_ = best_match("tau_densa", colunas); idx_tau = colunas.index(c) if c else len(colunas)

# ---------- 2) GOLD: filtrar 10005 válidos (Wr/Wm finitos) e salvar ----------
for c in ["wr_kg_m2_h","wm_kg_m2_h"]:
    if c not in df.columns: raise RuntimeError(f"Coluna faltando no v4: {c}")

mask_gold = np.isfinite(pd.to_numeric(df["wr_kg_m2_h"], errors="coerce")) & \
            np.isfinite(pd.to_numeric(df["wm_kg_m2_h"], errors="coerce"))
df_gold = df.loc[mask_gold].copy()

# Leak-guard: excluir features que entram no cálculo (inclui bloco físico)
ALIASES = {
    "tau_main":     ["tau_densa","tau","t_leito","leito_temp_average","bed_temp","temperatura"],
    "temp_leito":   ["leito_temp_average","bed_temp","temperatura_leito","fornalha_leito_temp_average"],
    "air_total":    ["air_total_knm3_h_filled","air_total_knm3_h","air_total_nm3_h","air_total_mn3_h"],
    "air_primary":  ["air_primary_knm3_h","air_primary_nm3_h","air_primary"],
    "air_secondary":["air_secondary_knm3_h","air_secondary_nm3_h","air_secondary"],
    "dp_furn":      ["dp_fornalha","delta_p_fornalha","fornalha_dp","pressao_fornalha","dp_furnace"],
    "p_furn_a":     ["pressao_fornalha_a_inf","pressao_fornalha_a","furnace_a_pressure","fornalha_a_press"],
    "p_furn_b":     ["pressao_fornalha_b_inf","pressao_fornalha_b","furnace_b_pressure","fornalha_b_press"],
    "p_abs":        ["pressao_fornalha","furnace_pressure","pressao_plenum","pressao_leito","pressao_tambor"],
    "rho_co2":      ["densidade_co2","co2_density","rho_co2"],
    "o2":           ["o2_medio","o2_excess_pct","oxygen"]
}
def choose(qs):
    best, bs = None, -1.0
    for q in qs:
        c, s = best_match(q, colunas)
        if c and s > bs:
            best, bs = c, s
    return best

tau_main       = choose(ALIASES["tau_main"])
temp_leito_col = choose(ALIASES["temp_leito"])
air_total_col  = choose(ALIASES["air_total"])
air_pri_col    = choose(ALIASES["air_primary"])
air_sec_col    = choose(ALIASES["air_secondary"])
dp_furn_col    = choose(ALIASES["dp_furn"])
p_furn_a_col   = choose(ALIASES["p_furn_a"])
p_furn_b_col   = choose(ALIASES["p_furn_b"])
p_abs_col      = choose(ALIASES["p_abs"])
rho_co2_col    = choose(ALIASES["rho_co2"])
o2_col         = choose(ALIASES["o2"])

used_cols = set([tau_main, temp_leito_col, air_total_col, air_pri_col, air_sec_col,
                 dp_furn_col, p_furn_a_col, p_furn_b_col, p_abs_col, rho_co2_col, o2_col,
                 "air_total_knm3_h_filled"])
used_cols = {c for c in used_cols if c}
used_cols.update(colunas[idx_tau:])

# Features GOLD = tudo menos used_cols + alvos (mesmas linhas GOLD)
orig_cols = [c for c in df.columns if c not in ["wr_kg_m2_h","wm_kg_m2_h"]]
cols_keep_gold = [c for c in orig_cols if c not in used_cols] + ["wr_kg_m2_h","wm_kg_m2_h"]
df_gold_feat = df_gold[cols_keep_gold].copy()

# salvar GOLD
write_multiheader_csv(df_gold, dim_map, PATH_GOLD_L)
write_multiheader_csv(df_gold_feat, dim_map, PATH_GOLD_F)

print(f"GOLD: linhas={len(df_gold)} | salvo:")
print("  Rotulado:", PATH_GOLD_L)
print("  Features:", PATH_GOLD_F)

# ---------- 3) SILVER: refazer ΔP (interp curto + k·ρg·ν²) e recalc Wr/Wm ----------
# mapeamentos e séries base
def series_TK(df_loc, name):
    return TK_from_series(df_loc[name], dim_text(dim_map, name), fallback_name=name) if name else pd.Series(np.nan, index=df_loc.index)

TK = series_TK(df, tau_main) if tau_main else (series_TK(df, temp_leito_col) if temp_leito_col else pd.Series(np.nan, index=df.index))

if p_abs_col:
    Pabs = to_pa(df[p_abs_col], dim_text(dim_map, p_abs_col))
    if Pabs.median(skipna=True) < 2e5: Pabs = Pabs + P_N
else:
    Pabs = pd.Series(P_N, index=df.index)

DP_from_col = to_pa(df[dp_furn_col], dim_text(dim_map, dp_furn_col)) if dp_furn_col else pd.Series(np.nan, index=df.index)
DP_from_ab  = None
if p_furn_a_col and p_furn_b_col:
    Pa = to_pa(df[p_furn_a_col], dim_text(dim_map, p_furn_a_col))
    Pb = to_pa(df[p_furn_b_col], dim_text(dim_map, p_furn_b_col))
    DP_from_ab = (Pa - Pb).abs()

def looks_like_absolute(dp_name: str) -> bool:
    if not dp_name: return False
    n = dp_name.lower()
    return ("delta" not in n) and ("dp" not in n)

use_ab_all = False
if dp_furn_col and p_abs_col and (dp_furn_col == p_abs_col):
    use_ab_all = True
elif dp_furn_col and looks_like_absolute(dp_furn_col) and DP_from_ab is not None:
    use_ab_all = True

if use_ab_all and DP_from_ab is not None:
    DP0 = DP_from_ab.copy()
else:
    DP0 = DP_from_col.copy()
    if DP_from_ab is not None:
        need = ~np.isfinite(DP0) & np.isfinite(DP_from_ab)
        DP0.loc[need] = DP_from_ab.loc[need]

# Vazão real (usa a coluna preenchida se existir)
air_filled_col = "air_total_knm3_h_filled" if "air_total_knm3_h_filled" in df.columns else air_total_col
QN_total_filled = pd.to_numeric(df[air_filled_col], errors="coerce") if air_filled_col else pd.Series(np.nan, index=df.index)
Q_real = (QN_total_filled * 1000.0 / 3600.0) * (TK / T_N) * (P_N / Pabs)
nu_base = Q_real / A_EF
DP_ref  = np.nanmedian(DP0) if np.isfinite(DP0).any() else 1.0
nu_proxy = nu_base * np.power(np.maximum(DP0, 1.0) / max(DP_ref, 1.0), BETA_DP)

# fração CO2 robusta (não usa O2)
gamma = pd.Series(GAMMA_DEFAULT, index=df.index)

# ρg, μgas (robusto com TK_mu)
M_mix = (1 - gamma) * M_AIR + gamma * M_CO2
rho_g = (Pabs * M_mix) / (R * TK)

TK_mu = TK.where(np.isfinite(TK), 1100.0)
mu_gas = (1 - gamma) * mu_air_suth(TK_mu) + gamma * mu_co2_suth(TK_mu)
mu_gas = pd.to_numeric(mu_gas, errors="coerce").fillna(mu_air_suth(pd.Series(np.full(len(df), 1100.0))))

# ====== ΔP SILVER: interp curto e físico para lacunas longas ======
DP_interp = DP0.interpolate(method="linear", limit=6, limit_direction="both")
mask_nan  = DP_interp.isna() & np.isfinite(rho_g) & np.isfinite(nu_proxy)
k = np.nanmedian(DP0 / (rho_g * (nu_proxy**2)))
DP_hat = k * rho_g * (nu_proxy**2)

# clipping físico
p5, p95 = np.nanpercentile(DP0, [5, 95]) if np.isfinite(DP0).any() else (1.0, 1.0)
DP_hat = DP_hat.clip(lower=p5*0.5, upper=p95*1.5)

DP_silver = DP_interp.copy()
DP_silver.loc[mask_nan] = DP_hat.loc[mask_nan]

# δ, Wr, Wm — SILVER
epsP = 1.0
delta_proxy = (rho_g * (nu_proxy ** 2) * L_REF) / np.maximum(DP_silver, epsP)
delta_m = C_DELTA * delta_proxy

sigma_mat   = pd.Series(SIGMA_MAT_BASE,   index=df.index)
rho_fuel    = pd.Series(RHO_FUEL_BASE,    index=df.index)
sigma_steel = pd.Series(SIGMA_STEEL_BASE, index=df.index)
g = sigma_mat / np.maximum(rho_fuel, 1e-12)
h = sigma_steel / np.maximum(mu_gas,  1e-12)

nu_pos    = np.maximum(nu_proxy.fillna(0.0), 0.0)
delta_pos = np.maximum(delta_m.fillna(np.nan), 1e-9)
TK_pos    = np.maximum(TK.fillna(np.nan), 1.0)

wr_s = (ALPHA_BASE * (nu_pos**2.0) * (delta_pos**1.0) * np.exp(-TK_pos / TCRIT_BASE) * g).astype(float)
wm_s = (BETA_BASE  * (nu_pos**3.0) * np.sqrt(delta_pos)    * np.exp(-TK_pos / TOPT_BASE)  * h).astype(float)

# ---------- filtrar válidos e salvar SILVER ----------
mask_silver = np.isfinite(wr_s) & np.isfinite(wm_s)
df_silver = df.loc[mask_silver].copy()
df_silver["wr_kg_m2_h"] = wr_s.loc[mask_silver]
df_silver["wm_kg_m2_h"] = wm_s.loc[mask_silver]

# Features SILVER (mesma lista de exclusão)
orig_cols = [c for c in df.columns if c not in ["wr_kg_m2_h","wm_kg_m2_h"]]
cols_keep_silver = [c for c in orig_cols if c not in used_cols] + ["wr_kg_m2_h","wm_kg_m2_h"]
df_silver_feat = df_silver[cols_keep_silver].copy()

# salvar SILVER
write_multiheader_csv(df_silver, dim_map, PATH_SILV_L)
write_multiheader_csv(df_silver_feat, dim_map, PATH_SILV_F)

print(f"SILVER: linhas={len(df_silver)} | salvo:")
print("  Rotulado:", PATH_SILV_L)
print("  Features:", PATH_SILV_F)


GOLD: linhas=10005 | salvo:
  Rotulado: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_rotulado_v4_gold.csv
  Features: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_features_v4_gold.csv
SILVER: linhas=10005 | salvo:
  Rotulado: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_rotulado_v4_silver.csv
  Features: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_features_v4_silver.csv


In [2]:
# ==============================================================
# GOLD & SILVER2 (com imputação de Q_N e ΔP) a partir de v4
# ==============================================================

import os, re, unicodedata
import numpy as np
import pandas as pd
from difflib import SequenceMatcher

# ---------- caminhos ----------
CURATED_DIR = r"C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated"
PATH_ROT_V4 = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v4.csv")

PATH_GOLD_L = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v4_gold.csv")
PATH_GOLD_F = os.path.join(CURATED_DIR, "A1_ML_DL_features_v4_gold.csv")

PATH_SILV2_L = os.path.join(CURATED_DIR, "A1_ML_DL_rotulado_v4_silver2.csv")
PATH_SILV2_F = os.path.join(CURATED_DIR, "A1_ML_DL_features_v4_silver2.csv")

# ---------- constantes físicas e parâmetros ----------
T_N, P_N = 273.15, 1.01325e5
R = 8.314462618
M_AIR, M_CO2 = 0.029, 0.044

A_EF   = 15.0     # m²
L_REF  = 3.0      # m
BETA_DP = 0.25
C_DELTA = 1e-3
GAMMA_DEFAULT = 0.25

ALPHA_BASE = 4.2e-3
BETA_BASE  = 1.2e-4
TCRIT_BASE = 1200.0
TOPT_BASE  = 1250.0
SIGMA_MAT_BASE   = 6.0e9
RHO_FUEL_BASE    = 700.0
SIGMA_STEEL_BASE = 2.5e9

# ---------- utilidades ----------
def strip_accents(s: str) -> str:
    return "".join(c for c in unicodedata.normalize("NFKD", str(s)) if not unicodedata.combining(c))
def normalize_token(s: str) -> str:
    s = strip_accents(s).lower()
    s = re.sub(r"[^a-z0-9_\s]+", " ", s)
    return re.sub(r"\s+", " ", s).strip()
def split_tokens(s: str): return set(t for t in re.split(r"[ _]+", normalize_token(s)) if t)
def hybrid_similarity(a: str, b: str) -> float:
    ta, tb = split_tokens(a), split_tokens(b)
    jac = len(ta & tb) / len(ta | tb) if (ta and tb) else 0.0
    seq = SequenceMatcher(None, normalize_token(a), normalize_token(b)).ratio()
    return 0.6*jac + 0.4*seq
def best_match(q, cands):
    sc = [(c, hybrid_similarity(q, c)) for c in cands]
    sc.sort(key=lambda x: x[1], reverse=True)
    return sc[0] if sc else (None, 0.0)

def read_multiheader_csv(path):
    dfr = pd.read_csv(path, header=[0,1], engine="python")
    dim = {c: d for (c,d) in dfr.columns}
    df  = dfr.copy()
    df.columns = [c for (c,d) in dfr.columns]
    return df, dim

def write_multiheader_csv(df, dim, path):
    with open(path, "w", encoding="utf-8-sig", newline="") as f:
        f.write(",".join(df.columns) + "\n")
        f.write(",".join([str(dim.get(c,"")) for c in df.columns]) + "\n")
    df.to_csv(path, mode="a", index=False, header=False, encoding="utf-8-sig")

def dim_text(dim_map, col): return str(dim_map.get(col,"") or "").lower()
def to_pa(x, dimtxt):
    s = (dimtxt or "").lower()
    v = pd.to_numeric(x, errors="coerce")
    if "mpa" in s:   return v * 1e6
    if "kpa" in s:   return v * 1e3
    if "bar" in s:   return v * 1e5
    if "mbar" in s:  return v * 1e2
    return v

def TK_from_series(series, dimtxt, fallback_name=""):
    is_c = ("°c" in (dimtxt or "").lower()) or (("c" in (dimtxt or "").lower() and "k" not in (dimtxt or "").lower()))
    if is_c or ("tau" in str(fallback_name).lower() and not dimtxt):
        return pd.to_numeric(series, errors="coerce") + 273.15
    return pd.to_numeric(series, errors="coerce")

def mu_air_suth(TK):
    mu0, T0, S = 1.716e-5, 273.15, 111.0
    return mu0 * ((T0 + S) / (TK + S)) * (TK / T0) ** 1.5
def mu_co2_suth(TK):
    mu0, T0, S = 1.37e-5, 273.15, 240.0
    return mu0 * ((T0 + S) / (TK + S)) * (TK / T0) ** 1.5

def fill_blocks_avg(series, limit_interp=24, roll_win=48):
    """Interpola até 'limit_interp' pontos; preenche remanescentes com mediana móvel centrada (roll_win);
       e o que sobrar com mediana global."""
    s = pd.to_numeric(series, errors="coerce")
    s1 = s.interpolate(method="linear", limit=limit_interp, limit_direction="both")
    s2 = s1.copy()
    # completa NaNs com mediana móvel centrada
    roll_med = s2.rolling(roll_win, min_periods=1, center=True).median()
    s2 = s2.fillna(roll_med)
    # fallback global
    s3 = s2.fillna(np.nanmedian(s2))
    return s3

# ---------- 1) Ler o rotulado v4 ----------
df, dim_map = read_multiheader_csv(PATH_ROT_V4)
colunas = list(df.columns)

# localizar TAU_DENSA para o bloco físico
if "TAU_DENSA" in colunas:
    idx_tau = colunas.index("TAU_DENSA")
else:
    c,_ = best_match("tau_densa", colunas); idx_tau = colunas.index(c) if c else len(colunas)

# ---------- (A) GOLD ----------
mask_gold = np.isfinite(pd.to_numeric(df["wr_kg_m2_h"], errors="coerce")) & \
            np.isfinite(pd.to_numeric(df["wm_kg_m2_h"], errors="coerce"))
df_gold = df.loc[mask_gold].copy()

# leak-guard set
ALIASES = {
    "tau_main":     ["tau_densa","tau","t_leito","leito_temp_average","bed_temp","temperatura"],
    "temp_leito":   ["leito_temp_average","bed_temp","temperatura_leito","fornalha_leito_temp_average"],
    "air_total":    ["air_total_knm3_h_filled","air_total_knm3_h","air_total_nm3_h","air_total_mn3_h"],
    "air_primary":  ["air_primary_knm3_h","air_primary_nm3_h","air_primary"],
    "air_secondary":["air_secondary_knm3_h","air_secondary_nm3_h","air_secondary"],
    "dp_furn":      ["dp_fornalha","delta_p_fornalha","fornalha_dp","pressao_fornalha","dp_furnace"],
    "p_furn_a":     ["pressao_fornalha_a_inf","pressao_fornalha_a","furnace_a_pressure","fornalha_a_press"],
    "p_furn_b":     ["pressao_fornalha_b_inf","pressao_fornalha_b","furnace_b_pressure","fornalha_b_press"],
    "p_abs":        ["pressao_fornalha","furnace_pressure","pressao_plenum","pressao_leito","pressao_tambor"],
    "rho_co2":      ["densidade_co2","co2_density","rho_co2"],
    "o2":           ["o2_medio","o2_excess_pct","oxygen"]
}
def choose(qs):
    best, bs = None, -1.0
    for q in qs:
        c, s = best_match(q, colunas)
        if c and s > bs:
            best, bs = c, s
    return best

tau_main       = choose(ALIASES["tau_main"])
temp_leito_col = choose(ALIASES["temp_leito"])
air_total_col  = choose(ALIASES["air_total"])
air_pri_col    = choose(ALIASES["air_primary"])
air_sec_col    = choose(ALIASES["air_secondary"])
dp_furn_col    = choose(ALIASES["dp_furn"])
p_furn_a_col   = choose(ALIASES["p_furn_a"])
p_furn_b_col   = choose(ALIASES["p_furn_b"])
p_abs_col      = choose(ALIASES["p_abs"])
rho_co2_col    = choose(ALIASES["rho_co2"])
o2_col         = choose(ALIASES["o2"])

used_cols = set([tau_main, temp_leito_col, air_total_col, air_pri_col, air_sec_col,
                 dp_furn_col, p_furn_a_col, p_furn_b_col, p_abs_col, rho_co2_col, o2_col,
                 "air_total_knm3_h_filled"])
used_cols = {c for c in used_cols if c}
used_cols.update(colunas[idx_tau:])

# features GOLD
orig_cols = [c for c in df.columns if c not in ["wr_kg_m2_h","wm_kg_m2_h"]]
cols_keep_gold = [c for c in orig_cols if c not in used_cols] + ["wr_kg_m2_h","wm_kg_m2_h"]
df_gold_feat = df_gold[cols_keep_gold].copy()

write_multiheader_csv(df_gold, dim_map, PATH_GOLD_L)
write_multiheader_csv(df_gold_feat, dim_map, PATH_GOLD_F)
print(f"GOLD: linhas={len(df_gold)} | salvo:\n  Rotulado: {PATH_GOLD_L}\n  Features: {PATH_GOLD_F}")

# ---------- (B) SILVER2: imputar Q_N e ΔP e recalc Wr/Wm ----------
# séries base
def series_TK(df_loc, name):
    return TK_from_series(df_loc[name], dim_text(dim_map, name), fallback_name=name) if name else pd.Series(np.nan, index=df_loc.index)

TK = series_TK(df, tau_main) if tau_main else (series_TK(df, temp_leito_col) if temp_leito_col else pd.Series(np.nan, index=df.index))

if p_abs_col:
    Pabs = to_pa(df[p_abs_col], dim_text(dim_map, p_abs_col))
    if Pabs.median(skipna=True) < 2e5: Pabs = Pabs + P_N
else:
    Pabs = pd.Series(P_N, index=df.index)

DP_from_col = to_pa(df[dp_furn_col], dim_text(dim_map, dp_furn_col)) if dp_furn_col else pd.Series(np.nan, index=df.index)
DP_from_ab  = None
if p_furn_a_col and p_furn_b_col:
    Pa = to_pa(df[p_furn_a_col], dim_text(dim_map, p_furn_a_col))
    Pb = to_pa(df[p_furn_b_col], dim_text(dim_map, p_furn_b_col))
    DP_from_ab = (Pa - Pb).abs()

def looks_like_absolute(dp_name: str) -> bool:
    if not dp_name: return False
    n = dp_name.lower()
    return ("delta" not in n) and ("dp" not in n)

use_ab_all = False
if dp_furn_col and p_abs_col and (dp_furn_col == p_abs_col):
    use_ab_all = True
elif dp_furn_col and looks_like_absolute(dp_furn_col) and DP_from_ab is not None:
    use_ab_all = True

if use_ab_all and DP_from_ab is not None:
    DP0 = DP_from_ab.copy()
else:
    DP0 = DP_from_col.copy()
    if DP_from_ab is not None:
        need = ~np.isfinite(DP0) & np.isfinite(DP_from_ab)
        DP0.loc[need] = DP_from_ab.loc[need]

# --------- imputação de Q_N (Silver2) ----------
air_filled_col = "air_total_knm3_h_filled" if "air_total_knm3_h_filled" in df.columns else air_total_col
QN0 = pd.to_numeric(df[air_filled_col], errors="coerce") if air_filled_col else pd.Series(np.nan, index=df.index)

QN_s2 = fill_blocks_avg(QN0, limit_interp=24, roll_win=48)

# --------- construir Q_real, ν ----------
Q_real = (QN_s2 * 1000.0 / 3600.0) * (TK / T_N) * (P_N / Pabs)
nu_base = Q_real / A_EF
DP_ref  = np.nanmedian(DP0) if np.isfinite(DP0).any() else 1.0
nu_proxy = nu_base * np.power(np.maximum(DP0, 1.0) / max(DP_ref, 1.0), BETA_DP)

# --------- fração CO2 robusta (default) + ρg e μgas ---------
gamma = pd.Series(GAMMA_DEFAULT, index=df.index)
M_mix = (1 - gamma) * M_AIR + gamma * M_CO2
rho_g = (Pabs * M_mix) / (R * TK)

TK_mu = TK.where(np.isfinite(TK), 1100.0)
mu_gas = (1 - gamma) * mu_air_suth(TK_mu) + gamma * mu_co2_suth(TK_mu)
mu_gas = pd.to_numeric(mu_gas, errors="coerce").fillna(mu_air_suth(pd.Series(np.full(len(df), 1100.0))))

# --------- ΔP SILVER2: interp curto + k·ρg·ν² com clipping ----------
DP_interp = DP0.interpolate(method="linear", limit=6, limit_direction="both")
mask_nan  = DP_interp.isna() & np.isfinite(rho_g) & np.isfinite(nu_proxy)
k = np.nanmedian(DP0 / (rho_g * (nu_proxy**2)))
DP_hat = k * rho_g * (nu_proxy**2)

p5, p95 = np.nanpercentile(DP0, [5, 95]) if np.isfinite(DP0).any() else (1.0, 1.0)
DP_hat = DP_hat.clip(lower=p5*0.5, upper=p95*1.5)

DP_s2 = DP_interp.copy()
DP_s2.loc[mask_nan] = DP_hat.loc[mask_nan]

# --------- δ, Wr, Wm — SILVER2 ----------
epsP = 1.0
delta_proxy = (rho_g * (nu_proxy ** 2) * L_REF) / np.maximum(DP_s2, epsP)
delta_m = C_DELTA * delta_proxy

sigma_mat   = pd.Series(SIGMA_MAT_BASE,   index=df.index)
rho_fuel    = pd.Series(RHO_FUEL_BASE,    index=df.index)
sigma_steel = pd.Series(SIGMA_STEEL_BASE, index=df.index)
g = sigma_mat / np.maximum(rho_fuel, 1e-12)
h = sigma_steel / np.maximum(mu_gas,  1e-12)

nu_pos    = np.maximum(nu_proxy.fillna(0.0), 0.0)
delta_pos = np.maximum(delta_m.fillna(np.nan), 1e-9)
TK_pos    = np.maximum(TK.fillna(np.nan), 1.0)

wr_s2 = (ALPHA_BASE * (nu_pos**2.0) * (delta_pos**1.0) * np.exp(-TK_pos / TCRIT_BASE) * g).astype(float)
wm_s2 = (BETA_BASE  * (nu_pos**3.0) * np.sqrt(delta_pos)    * np.exp(-TK_pos / TOPT_BASE)  * h).astype(float)

# --------- filtrar válidos e salvar SILVER2 ----------
mask_s2 = np.isfinite(wr_s2) & np.isfinite(wm_s2)
df_s2 = df.loc[mask_s2].copy()
df_s2["wr_kg_m2_h"] = wr_s2.loc[mask_s2]
df_s2["wm_kg_m2_h"] = wm_s2.loc[mask_s2]

# features (mesma lista de exclusão do GOLD)
cols_keep_s2 = [c for c in orig_cols if c not in used_cols] + ["wr_kg_m2_h","wm_kg_m2_h"]
df_s2_feat = df_s2[cols_keep_s2].copy()

write_multiheader_csv(df_s2, dim_map, PATH_SILV2_L)
write_multiheader_csv(df_s2_feat, dim_map, PATH_SILV2_F)

print(f"SILVER2: linhas={len(df_s2)} | salvo:\n  Rotulado: {PATH_SILV2_L}\n  Features: {PATH_SILV2_F}")


GOLD: linhas=10005 | salvo:
  Rotulado: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_rotulado_v4_gold.csv
  Features: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_features_v4_gold.csv
SILVER2: linhas=11749 | salvo:
  Rotulado: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_rotulado_v4_silver2.csv
  Features: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated\A1_ML_DL_features_v4_silver2.csv


In [1]:
# ==============================================================
# Verificação de ADEQUAÇÃO (Supervisionado & Não-Supervisionado)
# GOLD e SILVER2 — todas as saídas em um único diretório
# ==============================================================

import os, re, unicodedata, warnings
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, r2_score, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

warnings.filterwarnings("ignore")

# ------------------ Caminhos ------------------
BASE_IN = r"C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\data\curated"
PATHS = {
    "GOLD":    os.path.join(BASE_IN, "A1_ML_DL_features_v4_gold.csv"),
    "SILVER2": os.path.join(BASE_IN, "A1_ML_DL_features_v4_silver2.csv"),
}

# >>> ÚNICO diretório de saída (será criado se não existir)
OUT_DIR = r"C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos"
os.makedirs(OUT_DIR, exist_ok=True)

# ------------------ Utils de leitura/escrita ------------------
def read_features_multiheader(path):
    # Lê 2 linhas de cabeçalho: (nome | dimensão)
    dfr = pd.read_csv(path, header=[0,1], engine="python")
    dim_map = {c:d for (c,d) in dfr.columns}
    df = dfr.copy()
    df.columns = [c for (c,d) in dfr.columns]
    return df, dim_map

def write_md(version, filename, text):
    path = os.path.join(OUT_DIR, f"{version.lower()}_{filename}")
    with open(path, "w", encoding="utf-8") as f:
        f.write(text)
    print("  - salvo:", path)

def write_csv(version, filename, df):
    path = os.path.join(OUT_DIR, f"{version.lower()}_{filename}")
    df.to_csv(path, index=False, encoding="utf-8-sig")
    print("  - salvo:", path)

# ------------------ Funções auxiliares ------------------
def strip_accents(s: str) -> str:
    return "".join(c for c in unicodedata.normalize("NFKD", str(s)) if not unicodedata.combining(c))
def normalize_token(s: str) -> str:
    s = strip_accents(s).lower()
    s = re.sub(r"[^a-z0-9_\s]+", " ", s)
    return re.sub(r"\s+", " ", s).strip()
def split_tokens(s: str): return set(t for t in re.split(r"[ _]+", normalize_token(s)) if t)

from difflib import SequenceMatcher
def hybrid_similarity(a: str, b: str) -> float:
    ta, tb = split_tokens(a), split_tokens(b)
    jac = len(ta & tb) / len(ta | tb) if (ta and tb) else 0.0
    seq = SequenceMatcher(None, normalize_token(a), normalize_token(b)).ratio()
    return 0.6*jac + 0.4*seq

LEAKY_PATTERNS = [
    r"^tau", r"leito_temp", r"bed_temp", r"temperatura", r"o2", r"co2", r"air_",
    r"pressao_fornalha", r"dp_fornalha", r"delta_p", r"fornalha_dp",
    r"pressao_plenum", r"pressao_leito", r"pressao_tambor",
    r"tau_densa", r"tau_diluida", r"tau_global", r"tau_backpass",
    r"air_total_knm3_h_filled",
]
TARGETS = ["wr_kg_m2_h","wm_kg_m2_h"]

def leakage_scan(columns):
    flags = []
    for c in columns:
        if c in TARGETS:
            continue
        for pat in LEAKY_PATTERNS:
            if re.search(pat, c, re.I):
                flags.append(c); break
    return sorted(set(flags))

def missingness(dfX):
    return dfX.isna().mean().sort_values(ascending=False)

def constant_columns(dfX, near_threshold=0.01):
    const, near = [], []
    for c in dfX.columns:
        x = dfX[c]
        if x.nunique(dropna=True) <= 1:
            const.append(c)
        else:
            ratio = x.nunique(dropna=True) / max(len(x),1)
            if ratio <= near_threshold:
                near.append((c, ratio))
    return const, near

def high_corr_pairs(dfX, thr=0.97, max_cols=200):
    X = dfX.select_dtypes(include=[np.number])
    if X.shape[1] > max_cols:
        X = X.iloc[:, :max_cols]
    corr = X.corr().abs()
    pairs = []
    cols = corr.columns
    for i in range(len(cols)):
        for j in range(i+1, len(cols)):
            if corr.iloc[i,j] >= thr:
                pairs.append((cols[i], cols[j], corr.iloc[i,j]))
    return sorted(pairs, key=lambda t: t[2], reverse=True)

def to_datetime_guess(df):
    cand_names = [c for c in df.columns if re.search(r"(time|hora|data|date|timestamp|datetime|ts)", c, re.I)]
    ts = None; col = None
    for c in cand_names:
        s = pd.to_datetime(df[c], errors="coerce", dayfirst=False, utc=False)
        if s.notna().mean() > 0.9:
            ts, col = s, c
            break
    return ts, col

def supervised_quick_eval(df, ts, ycols=TARGETS):
    y = df[ycols].apply(pd.to_numeric, errors="coerce")
    X = df.drop(columns=ycols).select_dtypes(include=[np.number])

    n = len(df)
    if n < 200 or X.shape[1] < 3:
        return None

    # split temporal 70/10/20
    order = np.arange(n)
    train_end = int(0.7*n); val_end = int(0.8*n)
    idx_train = order[:train_end]; idx_val = order[train_end:val_end]; idx_test = order[val_end:]

    imp = SimpleImputer(strategy="median")
    X_train = imp.fit_transform(X.iloc[idx_train])
    X_val   = imp.transform(X.iloc[idx_val])
    X_test  = imp.transform(X.iloc[idx_test])

    y_train = y.iloc[idx_train].values
    y_val   = y.iloc[idx_val].values
    y_test  = y.iloc[idx_test].values

    model = MultiOutputRegressor(RandomForestRegressor(
        n_estimators=200, random_state=42, n_jobs=-1
    ))
    model.fit(X_train, y_train)
    y_pred_val = model.predict(X_val)
    y_pred_tst = model.predict(X_test)

    def _m(y_true, y_pred):
        out = {}
        for i, col in enumerate(ycols):
            out[col] = {
                "R2": float(r2_score(y_true[:,i], y_pred[:,i])),
                "MAE": float(mean_absolute_error(y_true[:,i], y_pred[:,i]))
            }
        return out

    return {
        "n_train": int(len(idx_train)), "n_val": int(len(idx_val)), "n_test": int(len(idx_test)),
        "val": _m(y_val, y_pred_val),
        "test": _m(y_test, y_pred_tst),
        "num_features": int(X.shape[1])
    }

def unsupervised_quick_eval(df, ycols=TARGETS, sample_max=4000):
    X = df.drop(columns=ycols, errors="ignore").select_dtypes(include=[np.number])

    imp = SimpleImputer(strategy="median")
    X_imp = imp.fit_transform(X)
    scaler = StandardScaler()
    X_std = scaler.fit_transform(X_imp)

    n = X_std.shape[0]
    idx = np.arange(n)
    if n > sample_max:
        rng = np.random.RandomState(42)
        idx = rng.choice(n, size=sample_max, replace=False)
    Xs = X_std[idx]

    pca = PCA(n_components=min(30, Xs.shape[1]))
    Xp = pca.fit_transform(Xs)
    cumvar = np.cumsum(pca.explained_variance_ratio_)
    comps95 = int(np.searchsorted(cumvar, 0.95) + 1)

    Xp10 = Xp[:, :min(10, Xp.shape[1])]
    sil = []
    for k in range(2,7):
        km = KMeans(n_clusters=k, n_init=10, random_state=42)
        labels = km.fit_predict(Xp10)
        s = silhouette_score(Xp10, labels) if len(set(labels)) > 1 else np.nan
        sil.append((k, float(s)))
    return {
        "pca_cumvar_first10": [float(v) for v in cumvar[:10]],
        "n_components_95pct": comps95,
        "silhouette": sil
    }

# ------------------ Avaliação por versão ------------------
def assess_version(name, path_in):
    print(f"\n==== {name} ====")
    df, dim = read_features_multiheader(path_in)

    # Targets
    for t in TARGETS:
        if t not in df.columns:
            raise RuntimeError(f"[{name}] Alvo ausente: {t}")

    # Vazamento
    leak = leakage_scan(df.columns)

    # X/Y
    y = df[TARGETS].apply(pd.to_numeric, errors="coerce")
    X = df.drop(columns=TARGETS)

    # Faltas / Constantes / Colinearidade
    miss = missingness(X)
    const, near = constant_columns(X)
    pairs = high_corr_pairs(X, thr=0.97)

    # Salvar CSVs (com prefixo da versão)
    write_csv(name, "missingness.csv", miss.reset_index().rename(columns={"index":"col",0:"pct_missing"}))
    write_csv(name, "constants.csv", pd.DataFrame({"constant_cols": const}))
    write_csv(name, "near_constants.csv", pd.DataFrame(near, columns=["col","unique_ratio"]))
    write_csv(name, "high_corr_pairs.csv", pd.DataFrame(pairs, columns=["col_i","col_j","abs_corr"]))
    write_csv(name, "leakage_flags.csv", pd.DataFrame({"leaky_cols": leak}))

    # Tempo
    ts, ts_name = to_datetime_guess(df)
    if ts is not None:
        dif = ts.diff().dropna()
        hours = dif.dt.total_seconds()/3600.0
        pct_1h = (np.isclose(hours, 1.0)).mean()*100 if len(hours)>0 else np.nan
        gaps = int((hours>1.0).sum())
        tdiag = {"has_ts": True, "rows": int(len(ts)), "start": str(ts.iloc[0]),
                 "end": str(ts.iloc[-1]), "pct_step_1h": round(pct_1h,2), "num_gaps_gt_1h": gaps,
                 "ts_name": ts_name}
    else:
        tdiag = {"has_ts": False}

    # Supervisionado
    sup = supervised_quick_eval(df, ts, TARGETS)

    # Não-supervisionado
    unsup = unsupervised_quick_eval(df, TARGETS)

    # Relatório MD
    nrows, nfeat = df.shape[0], X.select_dtypes(include=[np.number]).shape[1]
    miss_med = float(miss.median()) if len(miss)>0 else 0.0
    ok_super = (sup is not None) and (sup["n_train"]>=1000) and (nfeat>=10)
    ok_unsup = True if unsup["n_components_95pct"] <= max(30, nfeat) else False

    md = []
    md.append(f"# Verificação de Adequação – {name}\n")
    md.append(f"- Linhas: **{nrows}**  |  Features numéricas: **{nfeat}**")
    md.append(f"- Alvos presentes: `{', '.join(TARGETS)}`")
    md.append(f"- Mediana de faltas nas features: **{100*miss_med:.2f}%**")
    md.append(f"- Colunas constantes: **{len(const)}**  | quase-constantes: **{len(near)}**")
    md.append(f"- Pares com |corr| ≥ 0.97: **{len(pairs)}**")
    if tdiag["has_ts"]:
        md.append(f"- Timestamp: **{tdiag['ts_name']}** | início: **{tdiag['start']}** | fim: **{tdiag['end']}**")
        md.append(f"- Passo de 1h (≈): **{tdiag['pct_step_1h']}%** | gaps >1h: **{tdiag['num_gaps_gt_1h']}**")
    else:
        md.append(f"- Timestamp: **não encontrado**")
    md.append("\n## Vazamento (features proibidas detectadas)")
    md.append(f"- Encontradas: **{len(leak)}**" + ("" if len(leak)==0 else f"\n  - " + "\n  - ".join(leak)))
    md.append("\n## Baseline – Supervisionado (RandomForest multi-alvo)")
    if sup is None:
        md.append("- **Dataset pequeno/insuficiente** para métrica robusta.")
    else:
        md.append(f"- Split: train={sup['n_train']}, val={sup['n_val']}, test={sup['n_test']}")
        for split in ["val","test"]:
            md.append(f"- {split.upper()}:")
            for tgt, m in sup[split].items():
                md.append(f"  - `{tgt}` → R²={m['R2']:.3f} | MAE={m['MAE']:.3g}")
    md.append("\n## Não-supervisionado (PCA + Silhouette)")
    md.append(f"- Componentes p/ 95% variância: **{unsup['n_components_95pct']}**")
    md.append(f"- Variância acumulada (10 comps): " + ", ".join([f"{v:.2f}" for v in unsup["pca_cumvar_first10"]]))
    md.append(f"- Silhouette k=2..6: " + ", ".join([f"k={k}:{s:.3f}" for k,s in unsup["silhouette"]]))
    md.append("\n## Juízo de adequação (heurístico)")
    md.append(f"- **Supervisionado**: {'OK' if ok_super else 'ATENÇÃO'} (n_train≥1000 e ≥10 features num.)")
    md.append(f"- **Não-supervisionado**: {'OK' if ok_unsup else 'ATENÇÃO'} (PCA ≤ 30 comps p/ 95% var.)")

    write_md(name, "relatorio.md", "\n".join(md))

    return {
        "name": name, "nrows": nrows, "nfeat_num": nfeat,
        "miss_median": miss_med, "sup": sup, "unsup": unsup, "temporal": tdiag
    }

# ------------------ Executa para GOLD e SILVER2 ------------------
summaries = {}
for ver, path in PATHS.items():
    summaries[ver] = assess_version(ver, path)

print("\n==== RESUMO FINAL ====")
for ver, s in summaries.items():
    print(f"{ver}: linhas={s['nrows']}, feats_num={s['nfeat_num']}, miss_mediana={100*s['miss_median']:.2f}%")
    if s["sup"] is not None:
        print(f"  Superv (VAL) R2: " + ", ".join([f"{k}={v['R2']:.3f}" for k,v in s['sup']['val'].items()]))
        print(f"  Superv (TST) R2: " + ", ".join([f"{k}={v['R2']:.3f}" for k,v in s['sup']['test'].items()]))
    sils = ", ".join([f"k={k}:{sc:.2f}" for k,sc in s["unsup"]["silhouette"]])
    print(f"  Silhouette: {sils}")
    if s["temporal"].get("has_ts", False):
        print(f"  TS ok: passo≈1h {s['temporal']['pct_step_1h']}% | gaps>1h {s['temporal']['num_gaps_gt_1h']}")
print("\nSaídas gravadas em:", OUT_DIR)



==== GOLD ====
  - salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos\gold_missingness.csv
  - salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos\gold_constants.csv
  - salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos\gold_near_constants.csv
  - salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos\gold_high_corr_pairs.csv
  - salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos\gold_leakage_flags.csv
  - salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos\gold_relatorio.md

==== SILVER2 ====
  - salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos\silver2_missingness.csv
  - salvo: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\diagnosticos\silver2_constants.csv
  - salvo: C:\Users\wilso\MBA_EMPREEN

In [2]:
# ===================== FREEZE DOS DADOS PARA MODELAGEM =====================
import os, shutil, json, hashlib, pandas as pd
from datetime import datetime

BASE = r"C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO"
CUR  = os.path.join(BASE, "data", "curated")
OUTS = os.path.join(BASE, "outputs", "diagnosticos")

# Tag de versão (ex.: v1_YYYYmmdd_HHMM)
TAG = "v1_" + datetime.now().strftime("%Y%m%d_%H%M")
FREEZE_DIR = os.path.join(BASE, "outputs", "freeze", TAG)
os.makedirs(FREEZE_DIR, exist_ok=True)

# Arquivos que vamos congelar (adicione se quiser mais)
FILES = [
    os.path.join(CUR, "A1_ML_DL_rotulado_v4_gold.csv"),
    os.path.join(CUR, "A1_ML_DL_features_v4_gold.csv"),
    os.path.join(CUR, "A1_ML_DL_rotulado_v4_silver2.csv"),
    os.path.join(CUR, "A1_ML_DL_features_v4_silver2.csv"),
    os.path.join(CUR, "A1_ML_DL_rotulado_v4.csv"),
    os.path.join(CUR, "A1_ML_DL_features_v4.csv"),
]

# Copiar diagnósticos principais (MDs)
DIAG_MD = [
    os.path.join(OUTS, "gold_relatorio.md"),
    os.path.join(OUTS, "silver2_relatorio.md"),
]

def sha256_of(path, chunk=1<<20):
    h = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            b = f.read(chunk)
            if not b: break
            h.update(b)
    return h.hexdigest()

# 1) Copia arquivos
copied = []
for p in FILES + DIAG_MD:
    if os.path.exists(p):
        dst = os.path.join(FREEZE_DIR, os.path.basename(p))
        shutil.copy2(p, dst)
        copied.append(dst)

# 2) Gera checksums
ck_path = os.path.join(FREEZE_DIR, "checksums_sha256.txt")
with open(ck_path, "w", encoding="utf-8") as f:
    for p in copied:
        f.write(f"{sha256_of(p)}  {os.path.basename(p)}\n")

# 3) Manifesto com contagem de linhas/colunas dos CSV principais
manifest = {"tag": TAG, "generated_at": datetime.now().isoformat(), "files": []}
for p in copied:
    item = {"file": os.path.basename(p), "path": p}
    if p.lower().endswith(".csv"):
        try:
            df = pd.read_csv(p, header=[0,1], engine="python")
            item["rows"] = int(df.shape[0])
            item["cols"] = int(len(df.columns))
            # registra nomes das colunas e dimensões (primeiras 10 colunas como amostra)
            cols = [str(c[0]) for c in df.columns]
            dims = [str(c[1]) for c in df.columns]
            item["sample_columns"] = cols[:10]
            item["sample_dims"]    = dims[:10]
        except Exception:
            pass
    manifest["files"].append(item)

with open(os.path.join(FREEZE_DIR, "manifest.json"), "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2, ensure_ascii=False)

print("✅ Freeze concluído em:", FREEZE_DIR)

# 4) (Opcional) Criar um ZIP da pasta
# import shutil
# shutil.make_archive(FREEZE_DIR, "zip", FREEZE_DIR)
# print("ZIP gerado em:", FREEZE_DIR + ".zip")
# ========================================================================== 


✅ Freeze concluído em: C:\Users\wilso\MBA_EMPREENDEDORISMO\3AGD\A1_LOCAL_REFAZIMENTO\outputs\freeze\v1_20250818_0635
