### 강수 기반 파생 특성

핵심 메커니즘: 강수 → 우수 유입 → 토사/유기물 동반 유입 → SS·TOC 급증

#### (1) 강수 변화율·집중도

- ΔRN_15m = RN_15m(t) − RN_15m(t−15m)

- ΔRN_60m, ΔRN_12H

- RN_15m / RN_60m (단기 집중도)

#### (2) 누적 강수 메모리 (Antecedent Rainfall)

- AR_3H = Σ RN_15m (최근 3시간)

- AR_6H, AR_12H, AR_24H

- log(1 + AR_x)

#### (3) 무강수 지속시간

- dry_duration = 마지막 강수 이후 경과 시간 (hour)

- rain_start / rain_end 플래그

- post_rain_6H 플래그(종료 후 잔류 효과)

#### API 지수(감쇠 누적)

- API(RN_15m, k, N) = Σ RN_15m(t−i)·exp(−k·i)

In [141]:
import numpy as np
import pandas as pd

In [142]:
def _add_rain_features(X, station_ids, rain_cols, rule, antecedent_hours, wet_thr_mm, eps, add_api, api_k, api_hours):
    """강수 관련 특성 생성"""
    new_cols = {}
    
    # Delta (변화율)
    for sid in station_ids:
        for rc in rain_cols[:3]:  # RN_15m, RN_60m, RN_12H만
            col = f"{rc}_{sid}"
            if col in X.columns:
                new_cols[f"d_{col}"] = X[col].diff()
    
    # Intensity ratio (단기 집중도)
    for sid in station_ids:
        rn15 = f"RN_15m_{sid}"
        rn60 = f"RN_60m_{sid}"
        if rn15 in X.columns and rn60 in X.columns:
            new_cols[f"RN_15m_div_RN_60m_{sid}"] = X[rn15] / (X[rn60] + eps)
    
    # Antecedent rainfall (누적 강수)
    steps_per_hour = int(pd.Timedelta("1h") / pd.Timedelta(rule))
    for sid in station_ids:
        col = f"RN_15m_{sid}"
        if col in X.columns:
            for h in antecedent_hours:
                win = h * steps_per_hour
                ar_col = X[col].rolling(win, min_periods=max(2, win // 10)).sum()
                new_cols[f"AR_{h}H_{sid}"] = ar_col
                new_cols[f"log1p_AR_{h}H_{sid}"] = np.log1p(ar_col)
    
    # Dry duration + first flush
    for sid in station_ids:
        col = f"RN_15m_{sid}"
        if col in X.columns:
            wet = (X[col].fillna(0) >= wet_thr_mm)
            last_wet_ts = pd.Series(X.index.where(wet), index=X.index).ffill()
            dry_timedelta = pd.to_timedelta(X.index - last_wet_ts)
            dry_min = dry_timedelta / pd.Timedelta(minutes=1)
            dry_hr  = dry_timedelta / pd.Timedelta(hours=1)

            new_cols[f"dry_duration_min_{sid}"] = pd.Series(dry_min, index=X.index).fillna(0.0)
            new_cols[f"dry_duration_hr_{sid}"]  = pd.Series(dry_hr,  index=X.index).fillna(0.0)
            new_cols[f"is_wet_{sid}"] = wet.astype(np.int8)
            
            # First flush 힌트
            rain_start = wet & (~wet.shift(1, fill_value=False))
            rain_end = (~wet) & (wet.shift(1, fill_value=False))
            new_cols[f"rain_start_{sid}"] = rain_start.astype(np.int8)
            new_cols[f"rain_end_{sid}"] = rain_end.astype(np.int8)
            
            # 종료 후 6시간 잔류
            post_win = 6 * steps_per_hour
            new_cols[f"post_rain_6H_{sid}"] = (
                pd.Series(rain_end.values, index=X.index)
                .rolling(post_win, min_periods=1).max()
                .fillna(0).astype(np.int8)
            )
    
    # API (감쇠 누적) - 선택적
    if add_api:
        api_steps = api_hours * steps_per_hour
        weights = np.exp(-api_k * np.arange(1, api_steps + 1, dtype=np.float32))
        for sid in station_ids:
            col = f"RN_15m_{sid}"
            if col in X.columns:
                rain = X[col].to_numpy(dtype=np.float32)
                api = np.full_like(rain, np.nan, dtype=np.float32)
                for t in range(len(rain)):
                    start = max(0, t - api_steps)
                    seg = rain[start:t]
                    if seg.size == 0:
                        api[t] = 0.0
                    else:
                        w = weights[-seg.size:]
                        api[t] = float(np.sum(seg[::-1] * w))
                new_cols[f"API_RN_15m_k{api_k}_H{api_hours}_{sid}"] = api
    
    # 한 번에 concat
    X = pd.concat([X, pd.DataFrame(new_cols, index=X.index)], axis=1)
    
    return X

In [143]:
def _add_station_agg_rain_features(X, station_ids=("368","541","569")):
    rn15_cols = [f"RN_15m_{sid}" for sid in station_ids if f"RN_15m_{sid}" in X.columns]
    if rn15_cols:
        X["RN_15m_max_all"]  = X[rn15_cols].max(axis=1)
        X["RN_15m_mean_all"] = X[rn15_cols].mean(axis=1)

    ar12_cols = [f"AR_12H_{sid}" for sid in station_ids if f"AR_12H_{sid}" in X.columns]
    if ar12_cols:
        X["AR_12H_sum_all"]  = X[ar12_cols].sum(axis=1)
        X["AR_12H_mean_all"] = X[ar12_cols].mean(axis=1)

    return X

### 기상 결합 특성 (TA, HM, TD 활용)

핵심 메커니즘: 증발·응결·생물활성 → 유기물 상태 변화

#### (4) 열·습 복합지표

- HeatIndex (TA + HM 기반)

- VaporPressureDeficit (VPD ≈ f(TA, HM))

- DewPointDepression = TA − TD

#### (5) 기상 안정성

- rolling_std(TA, 3H / 6H)

- rolling_std(HM)

In [144]:
def _add_weather_features(X, station_ids, weather_cols, rule, eps):
    """기상 관련 특성 생성"""
    new_cols = {}
    
    for sid in station_ids:
        ta_col = f"TA_{sid}"
        td_col = f"TD_{sid}"
        hm_col = f"HM_{sid}"
        
        # Dew point depression
        if ta_col in X.columns and td_col in X.columns:
            new_cols[f"TA_minus_TD_{sid}"] = X[ta_col] - X[td_col]
        
        # VPD (Vapor Pressure Deficit)
        if ta_col in X.columns and hm_col in X.columns:
            T = X[ta_col]
            RH = X[hm_col].clip(0, 100)
            e_s = 0.6108 * np.exp((17.27 * T) / (T + 237.3))
            new_cols[f"VPD_kPa_{sid}"] = e_s * (1 - RH / 100.0)
    
    # 한 번에 concat
    X = pd.concat([X, pd.DataFrame(new_cols, index=X.index)], axis=1)
    
    return X

### 공정 내부 변수 조합 (PH, FLUX, TN, TP)

핵심 메커니즘: “같은 TOC라도 상태가 다르다”

#### (6) 부하 관련 특성 (아주 중요)

- TOC_proxy_load = FLUX × PH

- SS_proxy_load = FLUX × (TN + TP)

#### (7) 영양염 비율

- TN/TP

- log(TN + TP)

- PH × TN, PH × TP

#### (8) 공정 상태 플래그

- PH_zone = {산성 / 중성 / 염기성}

- TN_high_flag (상위 20%)

- TP_spike_flag (z-score > 2)

In [145]:
def _add_process_features(X, process_cols, ph_thresholds, eps):
    """공정 관련 특성 생성"""
    new_cols = {}
    
    # Proxy loads
    if all(c in X.columns for c in ["FLUX_VU", "TN_VU", "TP_VU"]):
        new_cols["load_proxy_NP"] = X["FLUX_VU"] * (X["TN_VU"] + X["TP_VU"])
    
    if all(c in X.columns for c in ["PH_VU", "FLUX_VU"]):
        new_cols["PHxFLUX"] = X["PH_VU"] * X["FLUX_VU"]
    
    # TN/TP ratio
    if all(c in X.columns for c in ["TN_VU", "TP_VU"]):
        new_cols["TN_div_TP"] = X["TN_VU"] / (X["TP_VU"] + eps)
        new_cols["log1p_TN_TP"] = np.log1p(X["TN_VU"] + X["TP_VU"])
    
    # pH zone flags
    if "PH_VU" in X.columns:
        new_cols["pH_acid"] = (X["PH_VU"] < ph_thresholds[0]).astype(np.int8)
        new_cols["pH_neutral"] = ((X["PH_VU"] >= ph_thresholds[0]) & (X["PH_VU"] <= ph_thresholds[1])).astype(np.int8)
        new_cols["pH_basic"] = (X["PH_VU"] > ph_thresholds[1]).astype(np.int8)
    
    # 한 번에 concat
    X = pd.concat([X, pd.DataFrame(new_cols, index=X.index)], axis=1)
    
    return X

### 시계열 메모리 특성 (TOC·SS 예측에 결정적)

핵심 메커니즘: TOC·SS는 관성(inertia)이 큼

#### (9) Lag & Rolling

- lag_10m / 30m / 1H: PH, FLUX, TN, TP

- rolling_mean / max / std (30m, 1H, 3H)

In [146]:
def _add_temporal_features(
    X, 
    station_ids, 
    weather_cols, 
    process_cols, 
    rule="5min",
    roll_windows=("30min", "1h", "3h"),  # 시간 단위로 변경
    lags=("10min", "30min", "1h"),       # 시간 단위로 변경
):
    new_cols = {}
    
    # 대상 컬럼
    roll_targets = list(process_cols) + [f"{wc}_{sid}" for sid in station_ids for wc in weather_cols]
    lag_targets = (
        list(process_cols) + 
        [f"{col}_{sid}" for sid in station_ids for col in ["RN_15m", "RN_60m", "RN_12H"] + list(weather_cols)]
    )
    
    # Rolling
    for w in roll_windows:
        win_steps = int(pd.Timedelta(w) / pd.Timedelta(rule))
        minp = max(2, int(win_steps * 0.1))
        for col in roll_targets:
            if col in X.columns:
                r = X[col].rolling(win_steps, min_periods=minp)
                new_cols[f"{col}_roll_mean_{w}"] = r.mean()
                new_cols[f"{col}_roll_std_{w}"] = r.std(ddof=0)
                new_cols[f"{col}_roll_max_{w}"] = r.max()
    
    # Lags
    for lag in lags:
        lag_steps = int(pd.Timedelta(lag) / pd.Timedelta(rule))
        for col in lag_targets:
            if col in X.columns:
                new_cols[f"{col}_lag_{lag}"] = X[col].shift(lag_steps)
    
    # Delta
    for col in lag_targets:
        if col in X.columns:
            new_cols[f"d_{col}"] = X[col].diff()
    
    # Abs delta
    if "FLUX_VU" in X.columns:
        new_cols["abs_d_FLUX_VU"] = X["FLUX_VU"].diff().abs()
    
    X = pd.concat([X, pd.DataFrame(new_cols, index=X.index)], axis=1)
    return X

#### (10) 변화율

- ΔPH, ΔFLUX, ΔTN, ΔTP

- |ΔFLUX| (급변 여부)

### 상호작용 특성 (트리 계열에서 특히 강력)

- RN_15m × FLUX

- RN_60m × SS(t−1)

- (TN/TP) × PH

- dry_duration × RN_15m

In [147]:
def _add_interaction_features(X, station_ids):
    """상호작용 특성 생성"""
    new_cols = {}
    
    for sid in station_ids:
        rn15 = f"RN_15m_{sid}"
        rn60 = f"RN_60m_{sid}"
        dry_hr = f"dry_duration_hr_{sid}"
        
        if rn15 in X.columns and "FLUX_VU" in X.columns:
            new_cols[f"RN15xFLUX_{sid}"] = X[rn15] * X["FLUX_VU"]
        
        if rn60 in X.columns and "FLUX_VU" in X.columns:
            new_cols[f"RN60xFLUX_{sid}"] = X[rn60] * X["FLUX_VU"]
        
        if dry_hr in X.columns and rn15 in X.columns:
            new_cols[f"dryHr_x_RN15_{sid}"] = X[dry_hr] * X[rn15]
    
    # 한 번에 concat
    X = pd.concat([X, pd.DataFrame(new_cols, index=X.index)], axis=1)
    
    return X

In [148]:
def _add_time_features(X):
    """시간 관련 특성 생성"""
    new_cols = {}
    
    new_cols["hour"] = X.index.hour.astype(np.int16)
    new_cols["dow"] = X.index.dayofweek.astype(np.int8)
    new_cols["is_weekend"] = (X.index.dayofweek >= 5).astype(np.int8)
    
    # Cyclical encoding
    hour_values = X.index.hour
    new_cols["hour_sin"] = np.sin(2 * np.pi * hour_values / 24.0)
    new_cols["hour_cos"] = np.cos(2 * np.pi * hour_values / 24.0)
    
    # 한 번에 concat
    X = pd.concat([X, pd.DataFrame(new_cols, index=X.index)], axis=1)
    
    return X

In [149]:
def resample_5min(
    df,
    time_col = None,
    rule = "5min",
    sum_cols=("RN_15m_368", "RN_60m_368", "RN_12H_368", "RN_DAY_368", "RN_15m_541", "RN_60m_541", "RN_12H_541", "RN_DAY_541", "RN_15m_569", "RN_60m_569", "RN_12H_569", "RN_DAY_569", "FLUX_VU"),
    mean_cols=("TA_368", "HM_368", "TD_368", "TA_541", "HM_541", "TD_541", "TA_569", "HM_569", "TD_569", "PH_VU", "TN_VU", "TP_VU"),
    # 타깃(TOC/SS)도 같이 리샘플링해 학습셋 만들 때 쓰고 싶으면 여기에 넣어도 됨(기본 제외)
    extra_mean_cols=(),
    interp_limit: int = 12,  # 5분 기준 12칸=1시간까지 보간 허용
):
    """
    1분(or irregular) -> 5분 리샘플링.
    - rain_cols: sum
    - mean_cols: mean
    - 센서값은 time interpolation, 강수는 0 채움
    """
    x = df.copy()

    # datetime index 만들기
    if time_col is not None:
        x[time_col] = pd.to_datetime(x[time_col])
        x = x.set_index(time_col)
    if not isinstance(x.index, pd.DatetimeIndex):
        raise ValueError("df must have a DatetimeIndex or provide time_col.")
    x = x.sort_index()

    # 숫자화
    for c in set(sum_cols) | set(mean_cols) | set(extra_mean_cols):
        if c in x.columns:
            x[c] = pd.to_numeric(x[c], errors="coerce")

    # 집계 dict 만들기
    agg = {}
    for c in sum_cols:
        if c in x.columns:
            agg[c] = "sum"
    for c in list(mean_cols) + list(extra_mean_cols):
        if c in x.columns:
            agg[c] = "mean"

    if not agg:
        raise ValueError("No columns found to resample based on provided col lists.")

    # 리샘플
    out = x.resample(rule).agg(agg)

    # 결측 처리
    # 1) 강수: 없으면 0 (무강수)
    for c in sum_cols:
        if c in out.columns:
            out[c] = out[c].fillna(0.0)

    # 2) 센서/상태: 시간 보간(너무 긴 공백은 남김)
    for c in list(mean_cols) + list(extra_mean_cols):
        if c in out.columns:
            out[c] = out[c].interpolate(method="time", limit=interp_limit)

    return out

In [150]:
def make_modelA_features(
    df,
    time_col = None,
    do_resample = True,
    rule = "5min",
    station_ids=("368", "541", "569"),
    rain_cols=("RN_15m", "RN_60m", "RN_12H", "RN_DAY"),
    weather_cols=("TA", "HM", "TD"),
    process_cols=("PH_VU", "FLUX_VU", "TN_VU", "TP_VU"),
    roll_windows=("30min", "1h", "3h"),
    lags=("10min", "30min", "1h"),
    antecedent_hours=(3, 6, 12, 24),
    wet_thr_mm = 0.1,    # 5분 누적 기준
    ph_thresholds=(6.5, 8.5),
    spike_z=2.0,
    spike_window="24h",
    eps=1e-6,
    add_api=False,
    api_k=0.01,
    api_hours=24,
):
    """
    Model A(TOC + SS) 특성 엔지니어링.
    - Assumes 1-min time index OR a datetime column given by time_col.
    - Returns X (features only).
    """

    base_cols = []
    for sid in station_ids:
        base_cols.extend([f"{wc}_{sid}" for wc in weather_cols])
        base_cols.extend([f"{rc}_{sid}" for rc in rain_cols])
    base_cols.extend(process_cols)

    # 2. 리샘플링
    if do_resample:
        sum_cols = tuple(f"{rc}_{sid}" for sid in station_ids for rc in rain_cols) + ("FLUX_VU",)
        mean_cols = tuple(f"{wc}_{sid}" for sid in station_ids for wc in weather_cols) + ("PH_VU", "TN_VU", "TP_VU")
        df = resample_5min(df, time_col=time_col, rule=rule, sum_cols=sum_cols, mean_cols=mean_cols)
        time_col = None

    # 3. 인덱스 설정
    x = df.copy()
    if time_col is not None:
        x[time_col] = pd.to_datetime(x[time_col])
        x = x.set_index(time_col)
    if not isinstance(x.index, pd.DatetimeIndex):
        raise ValueError("df must have a DatetimeIndex or provide time_col.")
    x = x.sort_index()

    # 4. 컬럼 검증
    missing = [c for c in base_cols if c not in x.columns]
    if missing:
        raise ValueError(f"Missing columns: {missing}")

    X = x[base_cols].copy()

    # 5. 강수 특성 생성
    X = _add_rain_features(X, station_ids, rain_cols, rule, antecedent_hours, wet_thr_mm, eps, add_api, api_k, api_hours)
    X = _add_station_agg_rain_features(x, station_ids)

    # 6. 기상 특성 생성
    X = _add_weather_features(X, station_ids, weather_cols, rule, eps)
    
    # 7. 공정 특성 생성
    X = _add_process_features(X, process_cols, ph_thresholds, eps)
    
    # 8. 시계열 특성 (rolling & lag)
    X = _add_temporal_features(X, station_ids, weather_cols, process_cols, rule, roll_windows, lags)
    
    # 9. 상호작용 특성
    X = _add_interaction_features(X, station_ids)
    
    # 10. 시간 특성
    X = _add_time_features(X)
    
    # 11. 정리
    X = X.replace([np.inf, -np.inf], np.nan)

    return X

In [151]:
tms = pd.read_csv("../../data/processed/TMS_cleaned.csv")
aws = pd.read_csv("../../data/processed/AWS_cleaned.csv")

tms['SYS_TIME'] = pd.to_datetime(tms['SYS_TIME'])
aws['SYS_TIME'] = pd.to_datetime(aws['SYS_TIME'])

tms = tms.sort_values('SYS_TIME')
aws = aws.sort_values('SYS_TIME')

df = pd.merge_asof(tms, aws, on='SYS_TIME', direction='backward', tolerance=pd.Timedelta('1min'))

In [152]:
X = make_modelA_features(df, time_col="SYS_TIME", do_resample=True, rule="5min")

y_toc = resample_5min(df, time_col="SYS_TIME", extra_mean_cols=("TOC_VU","SS_VU"))["TOC_VU"].shift(-1)
y_ss  = resample_5min(df, time_col="SYS_TIME", extra_mean_cols=("TOC_VU","SS_VU"))["SS_VU"].shift(-1)

data = X.join(pd.DataFrame({"y_toc": y_toc, "y_ss": y_ss})).dropna()

In [153]:
data

Unnamed: 0_level_0,RN_15m_368,RN_60m_368,RN_12H_368,RN_DAY_368,RN_15m_541,RN_60m_541,RN_12H_541,RN_DAY_541,RN_15m_569,RN_60m_569,...,RN60xFLUX_541,RN15xFLUX_569,RN60xFLUX_569,hour,dow,is_weekend,hour_sin,hour_cos,y_toc,y_ss
SYS_TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2024-08-26 16:05:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,16,0,0,-0.866025,-0.500000,4.06,0.80
2024-08-26 16:10:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,16,0,0,-0.866025,-0.500000,3.10,1.30
2024-08-26 16:15:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,16,0,0,-0.866025,-0.500000,3.10,1.56
2024-08-26 16:20:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,16,0,0,-0.866025,-0.500000,3.10,1.28
2024-08-26 16:25:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,16,0,0,-0.866025,-0.500000,3.10,1.38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-09-29 04:55:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4,0,0,0.866025,0.500000,3.76,0.60
2025-09-29 05:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5,0,0,0.965926,0.258819,3.60,0.60
2025-09-29 05:05:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5,0,0,0.965926,0.258819,3.60,0.60
2025-09-29 05:10:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5,0,0,0.965926,0.258819,3.60,0.60


In [154]:
data.to_csv("../../data/processed/modelA_dataset.csv", index=True)