# üìä Valida√ß√£o de Modelos de S√©ries Temporais - CVC Lojas

## üéØ Objetivo Executivo
Este notebook tem como objetivo realizar a **valida√ß√£o robusta (Backtesting)** de m√∫ltiplos algoritmos de previs√£o de vendas para as lojas da CVC. O processo simula cen√°rios reais do passado para garantir que o modelo escolhido tenha performance consistente ao longo do tempo, e n√£o apenas em um √∫nico per√≠odo de teste.

## üõ†Ô∏è Metodologia: Walk-Forward Validation (Strict Mode)
Diferente da divis√£o tradicional (Treino/Teste), utilizamos a estrat√©gia de **Walk-Forward** (Janela Deslizante):
1.  O modelo treina com dados at√© uma data de corte (ex: Dez/2024).
2.  Faz a previs√£o para o m√™s seguinte (ex: Jan/2025).
3.  A janela avan√ßa 1 m√™s, o modelo retreina com os dados reais de Jan/2025 e prev√™ Fev/2025.
4.  Isso se repete por 12 meses (Folds), gerando m√©tricas de erro (RMSE, SMAPE) para cada m√™s.

> **Nota:** O modo "Strict" garante que **nenhum dado do futuro** (vazamento de dados) seja acess√≠vel ao modelo durante o treino, simulando fielmente a produ√ß√£o.

---

## ü§ñ Estrat√©gia de Modelos (Model)
A pipeline avalia automaticamente duas classes de algoritmos via biblioteca **Darts**:

### 1. Machine Learning Cl√°ssico (Regressores)
* **Linear Regression:** Baseline simples para capturar tend√™ncias lineares.
* **Random Forest:** Captura n√£o-linearidades e intera√ß√µes complexas.
* **LightGBM / XGBoost / CatBoost:** Modelos baseados em *Gradient Boosting*, estado da arte para dados tabulares e s√©ries temporais com covari√°veis.

### 2. Deep Learning (SOTA - State of the Art)
* **TFT (Temporal Fusion Transformer):** Modelo de aten√ß√£o que aprende a import√¢ncia de cada vari√°vel ao longo do tempo.
* **N-BEATS:** Rede neural baseada em blocos de tend√™ncia e sazonalidade.
* **Transformer:** Arquitetura cl√°ssica de *Attention* adaptada para s√©ries temporais.
* **BlockRNN (LSTM):** Redes recorrentes para capturar depend√™ncias de longo prazo.
* **TCN (Temporal Convolutional Network):** Convolu√ß√µes causais para capturar padr√µes locais e globais.

---

## üèõÔ∏è Arquitetura e Governan√ßa (Databricks Unity Catalog)
Este notebook implementa uma arquitetura h√≠brida para conformidade com o Unity Catalog:

| Componente | Local de Armazenamento | Fun√ß√£o |
| :--- | :--- | :--- |
| **Experimentos** | `Workspace/Users/...` | Armazena m√©tricas, gr√°ficos e logs de execu√ß√£o (evita erro de path do UC). |
| **Registro de Modelos** | **Unity Catalog** (`ds_dev.cvc_val`) | O modelo final (`.pkl`) √© versionado e governado oficialmente no cat√°logo. |
| **Assinatura (Signature)** | **Enforced** | Todos os modelos possuem contrato de entrada/sa√≠da (`long` -> `double`) validado para evitar erros de tipagem no serving. |

## üì• Dados de Entrada
* **Target:** `bip_vhistorico_targuet_loja` (Vendas hist√≥ricas).
* **Covari√°veis Futuras:** `bip_vhistorico_feriados_loja` (Calend√°rio nacional/regional).
* **Covari√°veis Globais:** `bip_vhistorico_suporte_canal_loja` (Indicadores macroecon√¥micos e campanhas).

In [0]:
# --- CONFIGURA√á√ïES GLOBAIS DE OTIMIZA√á√ÉO (BEST PRACTICES) ---
# Ativa otimiza√ß√£o autom√°tica de grava√ß√µes e compacta√ß√£o
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")


In [0]:
# --- IMPORTS (REFATORED) ---
%load_ext autoreload
%autoreload 2

import sys
import pickle
import os
sys.path.append(os.getcwd())

from src.validation.config import Config
from src.validation.data import DataIngestion 
from src.validation.pipeline import ProjectPipeline
from src.validation.trainer import ModelTrainer
from darts import TimeSeries
from darts.dataprocessing.pipeline import Pipeline
from darts.dataprocessing.transformers import (
    Scaler,
    StaticCovariatesTransformer,
    MissingValuesFiller
)
from darts.utils.timeseries_generation import datetime_attribute_timeseries
from darts.models import (
    TFTModel,
    NBEATSModel,
    TransformerModel,
    LinearRegressionModel,
    LightGBMModel,
    XGBModel,
    CatBoostModel,
    RandomForest,
    BlockRNNModel,
    RNNModel,
    TCNModel
)
from darts.metrics import mape, mse, rmse, r2_score, smape
from pytorch_lightning.callbacks import EarlyStopping

# Bibliotecas Padr√£o
import pandas as pd
import numpy as np
import mlflow

# Ingest√£o Imports
from databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup
import pyspark.sql.functions as F
from darts import TimeSeries
from darts.utils.timeseries_generation import datetime_attribute_timeseries

In [0]:
if spark is None:
    raise RuntimeError("Spark Session not available.")

config = Config()
config.spark_session = spark

In [0]:
# --- EXECU√á√ÉO DO PIPELINE (OTIMIZADO COM FEATURE STORE) ---
print(f"üöÄ Iniciando Pipeline v{config.VERSION} (Walk-Forward Strict Mode)")

ingestion = DataIngestion(spark, config)

# No bloco de execu√ß√£o:
# 1. Busca Unificada (Feature Store + Spark ETL)
df_spark_wide = ingestion.create_training_set() # Retorna Spark DF
df_support_global = ingestion.get_global_support() # Retorna Pandas (pois √© pequeno)

# 2. Constru√ß√£o dos Objetos Darts (Aqui ocorre o toPandas)
raw_series, raw_covs = ingestion.build_darts_objects(df_spark_wide, df_support_global)

# --- Daqui para baixo, o c√≥digo original de treino se mant√©m igual ---
# 3. SPLIT DE TREINO
train_cutoff_date = pd.Timestamp(config.TRAIN_END_DATE) - pd.Timedelta(days=1)
print(f"‚úÇÔ∏è Data corte para treino est√°tico: {train_cutoff_date.date()}")

print("üõ†Ô∏è Ajustando Pipeline (Scalers)...")
project_pipeline = ProjectPipeline()

train_for_fit = [s.drop_after(train_cutoff_date) for s in raw_series]
cov_for_fit = [s.drop_after(train_cutoff_date) for s in raw_covs]
project_pipeline.fit(train_for_fit, cov_for_fit)

print("üîÑ Transformando TODAS as s√©ries...")
series_scaled_full, cov_scaled_full = project_pipeline.transform(raw_series, raw_covs)

# Salvar Pipeline
pipeline_path = f"{config.PATH_SCALERS}/project_pipeline_v{config.VERSION}.pkl"
with open(pipeline_path, 'wb') as f:
    pickle.dump(project_pipeline, f)
print(f"üíæ Pipeline salvo: {pipeline_path}")

# Filtragem de S√©ries Curtas e Sets Finais
print("üîç Filtrando s√©ries curtas...")
min_len = config.LAGS + config.FORECAST_HORIZON + 1
valid_indices = [i for i, ts in enumerate(train_for_fit) if len(ts) >= min_len]

train_series_static = [series_scaled_full[i].drop_after(train_cutoff_date) for i in valid_indices]
train_cov_static = [cov_scaled_full[i].drop_after(train_cutoff_date) for i in valid_indices]
full_series_valid = [series_scaled_full[i] for i in valid_indices]
full_cov_valid = [cov_scaled_full[i] for i in valid_indices]

print("üîÑ Preparando targets originais para valida√ß√£o...")
val_series_original = project_pipeline.inverse_transform(full_series_valid, partial=True)

# 4. CONFIGURA√á√ÉO DE MODELOS
lag = config.LAGS
lag_covariantes = config.LAGS_FUTURE
forecast = config.FORECAST_HORIZON
lag_2 = lag + config.FORECAST_HORIZON # Lag estendido para Deep Learning
dynamic_kernel = 3 # Kernel safe size
EARLY_STOPPER = EarlyStopping(monitor="train_loss", patience=5, min_delta=0.001, mode='min')

models_dict = {
    # --- MODELOS ESTAT√çSTICOS / ML CL√ÅSSICO ---
    "LinearRegression": LinearRegressionModel(
        lags=lag,
        lags_future_covariates=lag_covariantes,
        output_chunk_length=forecast,
        multi_models=True
    ),
    "RandomForest": RandomForest(
        lags=lag,
        lags_future_covariates=lag_covariantes,
        output_chunk_length=forecast,
        multi_models=False, # RF sklearn limita√ß√£o
        random_state=42
    ),
    "LightGBM": LightGBMModel(
        lags=lag,
        lags_future_covariates=lag_covariantes,
        output_chunk_length=forecast,
        multi_models=True,
        random_state=42
    ),
    "XGBoost": XGBModel(
        lags=lag,
        lags_future_covariates=lag_covariantes,
        output_chunk_length=forecast,
        multi_models=True,
        random_state=42
    ),
    "CatBoost": CatBoostModel(
        lags=lag,
        lags_future_covariates=lag_covariantes,
        output_chunk_length=forecast,
        multi_models=True,
        random_state=42
    )
}

# --- MODELOS DE DEEP LEARNING (Adicionados se N_EPOCHS > 0) ---
if config.N_EPOCHS > 0:
    pl_trainer_kwargs = {"accelerator": "cpu", "callbacks": [EARLY_STOPPER]}
    models_dict.update({
        "TFT": TFTModel(
            input_chunk_length=lag_2,
            output_chunk_length=forecast,
            hidden_size=128,
            lstm_layers=2,
            num_attention_heads=4,
            dropout=0.2,
            batch_size=4,
            n_epochs=config.N_EPOCHS,
            add_relative_index=True,
            random_state=42,
            pl_trainer_kwargs=pl_trainer_kwargs
        ),
        "NBEATS": NBEATSModel(
            input_chunk_length=lag_2,
            output_chunk_length=forecast,
            generic_architecture=True,
            num_stacks=3,
            num_blocks=3,
            num_layers=4,
            layer_widths=256,
            batch_size=4,
            n_epochs=config.N_EPOCHS,
            random_state=42,
            pl_trainer_kwargs=pl_trainer_kwargs
        ),
        "Transformer": TransformerModel(
            input_chunk_length=lag_2,
            output_chunk_length=forecast,
            d_model=128,
            nhead=4,
            num_encoder_layers=3,
            num_decoder_layers=3,
            dim_feedforward=256,
            dropout=0.2,
            batch_size=4,
            n_epochs=config.N_EPOCHS,
            random_state=42,
            pl_trainer_kwargs=pl_trainer_kwargs
        ),
        "BlockRNN": BlockRNNModel(
            model='LSTM',
            input_chunk_length=lag_2,
            output_chunk_length=forecast,
            hidden_dim=128,
            n_rnn_layers=2,
            dropout=0.2,
            batch_size=4,
            n_epochs=config.N_EPOCHS,
            random_state=42,
            pl_trainer_kwargs=pl_trainer_kwargs
        ),
        "TCN": TCNModel(
            input_chunk_length=lag_2,
            output_chunk_length=forecast,
            kernel_size=dynamic_kernel,
            num_filters=lag_2,
            num_layers=None,
            dilation_base=2,
            dropout=0.2,
            batch_size=4,
            n_epochs=config.N_EPOCHS,
            random_state=42,
            pl_trainer_kwargs=pl_trainer_kwargs
        )
    })

trainer = ModelTrainer(config, models_dict)
trainer.train_evaluate_walkforward(
    train_series_static=train_series_static,
    train_covs_static=train_cov_static,
    full_series_scaled=full_series_valid,
    full_covariates_scaled=full_cov_valid,
    val_series_original=val_series_original,
    target_pipeline=project_pipeline
)
print("‚úÖ Processo Finalizado.")