# ≡ƒñû Modelagem Preditiva - E-commerce

Este notebook implementa modelos preditivos para prever vendas do e-commerce.

**Objetivos:**
1. Prever **Revenue** usando regress├úo (Random Forest)
2. Prever vendas futuras usando **s├⌐ries temporais** (ARIMA/SARIMA e Prophet)

**Dados:** Arquivos processados de `data/processed/`


## 1. Setup e Carregamento de Dados


In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from glob import glob
import warnings
warnings.filterwarnings("ignore")

# Configura├º├╡es
%matplotlib inline
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

print("Bibliotecas carregadas com sucesso!")


Bibliotecas carregadas com sucesso!


In [2]:
# Carregar dados processados
processed_path = Path("../../data/processed/ecommerce_clean.parquet")

if not processed_path.exists():
    print("ERRO: Arquivo n├úo encontrado!")
    print("Execute primeiro: notebooks/01_exploracao/ecommerce_eda.ipynb")
else:
    df = pd.read_parquet(processed_path)
    print(f"Dataset carregado: {df.shape[0]:,} linhas x {df.shape[1]} colunas")
    print(f"Per├¡odo: {df['Order Date'].min()} at├⌐ {df['Order Date'].max()}")
    print(f"\nColunas dispon├¡veis:")
    print(df.columns.tolist())


ERRO: Arquivo n├úo encontrado!
Execute primeiro: notebooks/01_exploracao/ecommerce_eda.ipynb


## 2. Feature Engineering para Modelagem


In [3]:
# Renomear colunas para consist├¬ncia
df = df.rename(columns={
    "Order Date": "OrderDate",
    "Quantity Ordered": "Quantity",
    "Price Each": "Price"
})

# Adicionar features temporais se n├úo existirem
if "Year" not in df.columns:
    df["Year"] = df["OrderDate"].dt.year
    df["Month_num"] = df["OrderDate"].dt.month
    df["Day"] = df["OrderDate"].dt.day
    df["Hour"] = df["OrderDate"].dt.hour
    df["Dow"] = df["OrderDate"].dt.dayofweek  # 0=segunda

print("Features temporais criadas!")
print(df[["OrderDate", "Year", "Month_num", "Day", "Hour", "Dow", "Revenue"]].head())


NameError: name 'df' is not defined

---

# PARTE 1: REGRESS├âO (Random Forest)

**Objetivo:** Prever Revenue usando features como Produto, Cidade, Estado, Hora, etc.


## 3. Prepara├º├úo: Agrega├º├úo Mensal por Produto e Cidade


In [None]:
# Agregar por m├¬s ├ù produto ├ù cidade para reduzir ru├¡do
agg = (df
       .assign(YearMonth=df["OrderDate"].dt.to_period("M").dt.to_timestamp())
       .groupby(["YearMonth", "Product", "City", "State"], as_index=False)
       .agg(
           Quantity=("Quantity", "sum"),
           Revenue=("Revenue", "sum"),
           AvgPrice=("Price", "mean")
       )
)

# Adicionar features temporais do YearMonth
agg["YM_year"] = agg["YearMonth"].dt.year
agg["YM_month"] = agg["YearMonth"].dt.month

print(f"Dataset agregado: {agg.shape[0]:,} linhas")
print(f"Per├¡odo: {agg['YearMonth'].min()} at├⌐ {agg['YearMonth'].max()}")
print(f"\nPrimeiras linhas:")
print(agg.head(10))


## 4. Split Temporal (Treino/Teste)


In [None]:
# Split temporal: 80% treino, 20% teste
cutoff = agg["YearMonth"].quantile(0.8)
print(f"Cutoff: {cutoff}")
print(f"Treino: at├⌐ {cutoff}")
print(f"Teste: a partir de {cutoff}")

train = agg[agg["YearMonth"] < cutoff].copy()
test = agg[agg["YearMonth"] >= cutoff].copy()

print(f"\nTreino: {train.shape[0]:,} linhas")
print(f"Teste: {test.shape[0]:,} linhas")

# Preparar X e y
features_cols = ["Product", "City", "State", "YM_year", "YM_month", "AvgPrice"]
target = "Revenue"

X_train = train[features_cols]
y_train = train[target]
X_test = test[features_cols]
y_test = test[target]

print(f"\nX_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")


## 5. Pipeline: Pr├⌐-processamento + Random Forest


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

# Pipeline de pr├⌐-processamento
preprocessor = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["Product", "City", "State"]),
    ("num", "passthrough", ["YM_year", "YM_month", "AvgPrice"])
])

# Modelo Random Forest
rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1,
    verbose=0
)

# Pipeline completo
pipe = Pipeline([
    ("prep", preprocessor),
    ("model", rf)
])

print("Pipeline criado!")
print("Treinando modelo Random Forest...")


In [None]:
# Treinar modelo
import time
start = time.time()
pipe.fit(X_train, y_train)
elapsed = time.time() - start

print(f"Γ£à Modelo treinado em {elapsed:.2f} segundos!")


## 6. Avalia├º├úo do Modelo de Regress├úo


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predi├º├╡es
pred_train = pipe.predict(X_train)
pred_test = pipe.predict(X_test)

# M├⌐tricas de treino
mae_train = mean_absolute_error(y_train, pred_train)
rmse_train = mean_squared_error(y_train, pred_train, squared=False)
r2_train = r2_score(y_train, pred_train)

# M├⌐tricas de teste
mae_test = mean_absolute_error(y_test, pred_test)
rmse_test = mean_squared_error(y_test, pred_test, squared=False)
r2_test = r2_score(y_test, pred_test)

print("="*70)
print("RESULTADOS DO MODELO DE REGRESS├âO (Random Forest)")
print("="*70)
print(f"\n≡ƒôè TREINO:")
print(f"  MAE:  ${mae_train:,.2f}")
print(f"  RMSE: ${rmse_train:,.2f}")
print(f"  R┬▓:   {r2_train:.3f}")

print(f"\n≡ƒôè TESTE:")
print(f"  MAE:  ${mae_test:,.2f}")
print(f"  RMSE: ${rmse_test:,.2f}")
print(f"  R┬▓:   {r2_test:.3f}")
print("="*70)


In [None]:
# Visualizar predi├º├╡es vs real
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Scatter plot
axes[0].scatter(y_test, pred_test, alpha=0.5, s=50)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', lw=2, label='Predi├º├úo Perfeita')
axes[0].set_xlabel('Revenue Real ($)', fontsize=12)
axes[0].set_ylabel('Revenue Previsto ($)', fontsize=12)
axes[0].set_title('Predi├º├╡es vs Real (Teste)', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Res├¡duos
residuals = y_test - pred_test
axes[1].scatter(pred_test, residuals, alpha=0.5, s=50)
axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1].set_xlabel('Revenue Previsto ($)', fontsize=12)
axes[1].set_ylabel('Res├¡duos ($)', fontsize=12)
axes[1].set_title('An├ílise de Res├¡duos', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"MAPE (Erro Percentual M├⌐dio): {np.mean(np.abs(residuals / y_test)) * 100:.2f}%")


## 7. Previs├úo Futura (Pr├│ximos 3 Meses)


In [None]:
# Gerar grid de previs├úo para pr├│ximos 3 meses
last_date = agg["YearMonth"].max()
future_months = pd.period_range(last_date + pd.DateOffset(months=1), periods=3, freq="M").to_timestamp()

print(f"├Ültima data nos dados: {last_date}")
print(f"Prevendo para: {future_months.tolist()}")

# Grid: todas combina├º├╡es de Produto ├ù Cidade ├ù Estado ├ù M├¬s futuro
grid = (agg[["Product", "City", "State"]].drop_duplicates()
        .assign(key=1)
        .merge(pd.DataFrame({"YearMonth": future_months, "key": [1]*len(future_months)}), on="key")
        .drop("key", axis=1))

grid["YM_year"] = grid["YearMonth"].dt.year
grid["YM_month"] = grid["YearMonth"].dt.month

# Usar ├║ltimo pre├ºo m├⌐dio conhecido por grupo como proxy
last_price = agg.groupby(["Product", "City", "State"])["AvgPrice"].last()
grid = grid.merge(last_price.rename("AvgPrice"), on=["Product", "City", "State"], how="left")

print(f"\nGrid de previs├úo: {grid.shape[0]:,} combina├º├╡es")


In [None]:
# Fazer previs├╡es
future_pred = pipe.predict(grid[features_cols])
grid["PredRevenue"] = future_pred

# Top 10 previs├╡es por m├¬s
print("\n" + "="*70)
print("TOP 10 PREVIS├òES DE REVENUE POR M├èS FUTURO")
print("="*70)

for month in future_months:
    month_data = grid[grid["YearMonth"] == month].nlargest(10, "PredRevenue")
    print(f"\n≡ƒôà {month.strftime('%Y-%m')}:")
    print(month_data[["Product", "City", "State", "PredRevenue"]].to_string(index=False))
    print(f"   Total previsto: ${month_data['PredRevenue'].sum():,.2f}")


In [None]:
# Receita total prevista por m├¬s
monthly_forecast = grid.groupby("YearMonth")["PredRevenue"].sum()

plt.figure(figsize=(12, 6))
plt.bar(range(len(monthly_forecast)), monthly_forecast.values, color="steelblue", alpha=0.7)
plt.xticks(range(len(monthly_forecast)), [d.strftime("%Y-%m") for d in monthly_forecast.index], rotation=0)
plt.title("Previs├úo de Receita Total - Pr├│ximos 3 Meses (Random Forest)", 
          fontsize=14, fontweight="bold")
plt.xlabel("M├¬s")
plt.ylabel("Receita Prevista ($)")
plt.grid(True, alpha=0.3, axis='y')

for i, v in enumerate(monthly_forecast.values):
    plt.text(i, v, f'${v:,.0f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()


---

# PARTE 2: S├ëRIES TEMPORAIS

**Objetivo:** Prever receita total mensal usando ARIMA/SARIMA e Prophet


## 8. Prepara├º├úo: S├⌐rie Temporal Mensal


In [None]:
# Criar s├⌐rie temporal mensal total
monthly = (df.assign(YearMonth=df["OrderDate"].dt.to_period("M").dt.to_timestamp())
             .groupby("YearMonth", as_index=True)["Revenue"].sum()
             .sort_index())

print(f"S├⌐rie temporal mensal: {len(monthly)} meses")
print(f"Per├¡odo: {monthly.index.min()} at├⌐ {monthly.index.max()}")
print("\nDados:")
print(monthly)


In [None]:
# Visualizar s├⌐rie temporal
plt.figure(figsize=(14, 6))
plt.plot(monthly.index, monthly.values, marker='o', linewidth=2, markersize=8, color='darkblue')
plt.title("Receita Mensal Total - S├⌐rie Hist├│rica", fontsize=14, fontweight='bold')
plt.xlabel("M├¬s")
plt.ylabel("Receita ($)")
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


In [None]:
# Split temporal para s├⌐ries temporais
cutoff_ts = monthly.index[int(len(monthly) * 0.8)]
y_train_ts = monthly[monthly.index < cutoff_ts]
y_test_ts = monthly[monthly.index >= cutoff_ts]

print(f"Treino: {len(y_train_ts)} meses (at├⌐ {cutoff_ts})")
print(f"Teste: {len(y_test_ts)} meses (a partir de {cutoff_ts})")


## 9. Modelo SARIMA


In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
import itertools

print("Buscando melhor modelo SARIMA...")
print("(Isso pode levar alguns minutos...)\n")

# Grid search simplificado
p = d = q = range(0, 2)
pdq = list(itertools.product(p, [1], q))
seasonal_pdq = [(0, 1, 0, 12)]  # Simplificado para acelerar

best = None
best_aic = float("inf")

for order in pdq:
    for sorder in seasonal_pdq:
        try:
            model = SARIMAX(y_train_ts, 
                          order=order, 
                          seasonal_order=sorder,
                          enforce_stationarity=False,
                          enforce_invertibility=False)
            result = model.fit(disp=False, maxiter=200)
            
            if result.aic < best_aic:
                best_aic = result.aic
                best = (order, sorder)
                print(f"Γ£ô Novo melhor: {order}├ù{sorder} | AIC={best_aic:.2f}")
        except:
            continue

print(f"\nΓ£à Melhor modelo SARIMA: {best[0]} ├ù {best[1]}")
print(f"   AIC: {best_aic:.2f}")


In [None]:
# Treinar modelo final e fazer previs├╡es
model_sarima = SARIMAX(y_train_ts,
                      order=best[0],
                      seasonal_order=best[1],
                      enforce_stationarity=False,
                      enforce_invertibility=False)
result_sarima = model_sarima.fit(disp=False, maxiter=200)

# Previs├úo no conjunto de teste
pred_sarima_test = result_sarima.get_forecast(steps=len(y_test_ts)).predicted_mean

# M├⌐tricas
mae_sarima = mean_absolute_error(y_test_ts, pred_sarima_test)
rmse_sarima = mean_squared_error(y_test_ts, pred_sarima_test, squared=False)
mape_sarima = np.mean(np.abs((y_test_ts - pred_sarima_test) / y_test_ts)) * 100

print("="*70)
print("RESULTADOS SARIMA")
print("="*70)
print(f"MAE:  ${mae_sarima:,.2f}")
print(f"RMSE: ${rmse_sarima:,.2f}")
print(f"MAPE: {mape_sarima:.2f}%")
print("="*70)


In [None]:
# Visualizar previs├╡es SARIMA
plt.figure(figsize=(14, 6))
plt.plot(y_train_ts.index, y_train_ts.values, label='Treino', marker='o', linewidth=2)
plt.plot(y_test_ts.index, y_test_ts.values, label='Teste (Real)', marker='o', linewidth=2, color='green')
plt.plot(y_test_ts.index, pred_sarima_test.values, label='SARIMA (Previsto)', 
         marker='s', linewidth=2, linestyle='--', color='red')
plt.title("SARIMA: Previs├úo vs Real", fontsize=14, fontweight='bold')
plt.xlabel("M├¬s")
plt.ylabel("Receita ($)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## 10. Previs├úo Futura com SARIMA (6 meses)


In [None]:
# Retreinar com todos os dados e prever 6 meses futuros
model_full = SARIMAX(monthly,
                    order=best[0],
                    seasonal_order=best[1],
                    enforce_stationarity=False,
                    enforce_invertibility=False)
result_full = model_full.fit(disp=False, maxiter=200)

# Previs├úo futura
n_future = 6
forecast = result_full.get_forecast(steps=n_future)
pred_future = forecast.predicted_mean
conf_int = forecast.conf_int(alpha=0.2)  # IC 80%

print("="*70)
print("PREVIS├âO SARIMA - PR├ôXIMOS 6 MESES")
print("="*70)
for i, (date, value) in enumerate(pred_future.items()):
    lower = conf_int.iloc[i, 0]
    upper = conf_int.iloc[i, 1]
    print(f"{date.strftime('%Y-%m')}: ${value:,.2f}  (IC: ${lower:,.2f} - ${upper:,.2f})")
print("="*70)


In [None]:
# Visualizar previs├úo futura com intervalo de confian├ºa
plt.figure(figsize=(16, 6))

# Hist├│rico
plt.plot(monthly.index, monthly.values, label='Hist├│rico', marker='o', linewidth=2, color='blue')

# Previs├úo
future_dates = pred_future.index
plt.plot(future_dates, pred_future.values, label='Previs├úo SARIMA', 
         marker='s', linewidth=2, linestyle='--', color='red', markersize=8)

# Intervalo de confian├ºa
plt.fill_between(future_dates, conf_int.iloc[:, 0], conf_int.iloc[:, 1], 
                 alpha=0.3, color='red', label='IC 80%')

plt.title("SARIMA: Previs├úo de Receita - Pr├│ximos 6 Meses", fontsize=14, fontweight='bold')
plt.xlabel("M├¬s")
plt.ylabel("Receita ($)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## 11. Modelo Prophet


In [None]:
try:
    from prophet import Prophet
    
    # Preparar dados para Prophet (requer 'ds' e 'y')
    df_prophet = monthly.reset_index().rename(columns={"YearMonth": "ds", "Revenue": "y"})
    
    # Treino
    df_prophet_train = df_prophet[df_prophet["ds"] < cutoff_ts]
    df_prophet_test = df_prophet[df_prophet["ds"] >= cutoff_ts]
    
    print("Treinando modelo Prophet...")
    model_prophet = Prophet(
        yearly_seasonality=True,
        weekly_seasonality=False,
        daily_seasonality=False,
        seasonality_mode='multiplicative'
    )
    model_prophet.fit(df_prophet_train)
    
    # Previs├úo no teste
    future_test = model_prophet.make_future_dataframe(periods=len(df_prophet_test), freq='MS')
    forecast_test = model_prophet.predict(future_test)
    
    # Pegar apenas previs├╡es do per├¡odo de teste
    pred_prophet_test = forecast_test.tail(len(df_prophet_test))["yhat"].values
    
    # M├⌐tricas
    mae_prophet = mean_absolute_error(y_test_ts, pred_prophet_test)
    rmse_prophet = mean_squared_error(y_test_ts, pred_prophet_test, squared=False)
    mape_prophet = np.mean(np.abs((y_test_ts.values - pred_prophet_test) / y_test_ts.values)) * 100
    
    print("="*70)
    print("RESULTADOS PROPHET")
    print("="*70)
    print(f"MAE:  ${mae_prophet:,.2f}")
    print(f"RMSE: ${rmse_prophet:,.2f}")
    print(f"MAPE: {mape_prophet:.2f}%")
    print("="*70)
    
    prophet_available = True
    
except ImportError:
    print("ΓÜá∩╕Å Prophet n├úo est├í instalado!")
    print("Para instalar: pip install prophet")
    prophet_available = False


In [None]:
if prophet_available:
    # Previs├úo futura (6 meses)
    model_prophet_full = Prophet(
        yearly_seasonality=True,
        weekly_seasonality=False,
        daily_seasonality=False,
        seasonality_mode='multiplicative'
    )
    model_prophet_full.fit(df_prophet)
    
    future_prophet = model_prophet_full.make_future_dataframe(periods=6, freq='MS')
    forecast_prophet = model_prophet_full.predict(future_prophet)
    
    # Mostrar previs├╡es futuras
    print("="*70)
    print("PREVIS├âO PROPHET - PR├ôXIMOS 6 MESES")
    print("="*70)
    future_only = forecast_prophet.tail(6)[["ds", "yhat", "yhat_lower", "yhat_upper"]]
    for _, row in future_only.iterrows():
        print(f"{row['ds'].strftime('%Y-%m')}: ${row['yhat']:,.2f}  " + 
              f"(IC: ${row['yhat_lower']:,.2f} - ${row['yhat_upper']:,.2f})")
    print("="*70)


In [None]:
if prophet_available:
    # Visualizar componentes
    fig = model_prophet_full.plot_components(forecast_prophet)
    plt.tight_layout()
    plt.show()


## 12. Compara├º├úo Final dos Modelos


In [None]:
# Compara├º├úo dos modelos de s├⌐ries temporais
print("="*70)
print("COMPARA├ç├âO DE MODELOS - CONJUNTO DE TESTE")
print("="*70)

comparison = pd.DataFrame({
    "Modelo": ["Random Forest (Agregado)", "SARIMA", "Prophet"] if prophet_available else ["SARIMA"],
    "MAE": [f"${mae_test:,.2f}", f"${mae_sarima:,.2f}"] + ([f"${mae_prophet:,.2f}"] if prophet_available else []),
    "RMSE": [f"${rmse_test:,.2f}", f"${rmse_sarima:,.2f}"] + ([f"${rmse_prophet:,.2f}"] if prophet_available else []),
    "MAPE": [f"{np.mean(np.abs((y_test - pred_test) / y_test)) * 100:.2f}%", 
             f"{mape_sarima:.2f}%"] + ([f"{mape_prophet:.2f}%"] if prophet_available else [])
})

print(comparison.to_string(index=False))
print("="*70)


## 13. Conclus├╡es e Recomenda├º├╡es

### ≡ƒôè Resumo dos Resultados

**1. Random Forest (Regress├úo)**
- Γ£à Captura rela├º├╡es complexas entre produtos, cidades e tempo
- Γ£à Permite previs├╡es granulares (por produto ├ù cidade)
- Γ£à Bom para entender drivers de receita
- ΓÜá∩╕Å Requer mais features e pode n├úo captar tend├¬ncias de longo prazo

**2. SARIMA**
- Γ£à Modela sazonalidade e tend├¬ncia automaticamente
- Γ£à Intervalos de confian├ºa estat├¡sticos
- Γ£à Bom para previs├╡es de curto/m├⌐dio prazo
- ΓÜá∩╕Å Assume padr├╡es lineares

**3. Prophet**
- Γ£à Detecta automaticamente sazonalidade m├║ltipla
- Γ£à Robusto a dados faltantes
- Γ£à F├ícil de interpretar (componentes separados)
- ΓÜá∩╕Å Pode ser otimista em previs├╡es longas

### ≡ƒÄ» Recomenda├º├╡es

1. **Para previs├úo total mensal:** Use **SARIMA** ou **Prophet**
2. **Para previs├úo por produto/cidade:** Use **Random Forest**
3. **Melhor pr├ítica:** Combine ambos (ensemble)

### ≡ƒôü Pr├│ximos Passos

1. Experimentar com **XGBoost** ou **LightGBM**
2. Criar **ensemble** dos modelos
3. Adicionar features externas (feriados, promo├º├╡es)
4. Implementar **retreinamento autom├ítico**
5. Criar **dashboard de monitoramento**
