# üß† TEMPLATE ‚Äî FEATURE ENGINEERING NOTEBOOK

## Objetivo:

Criar, transformar e selecionar features que melhorem a performance do modelo,
a generaliza√ß√£o e a interpretabilidade, com base nos insights da EDA.

Princ√≠pios:
- Features devem ter justificativa anal√≠tica
- Evitar leakage
- Ser reproduz√≠vel
- Facilitar deploy

## üìö 1. Imports e Configura√ß√µes

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    MinMaxScaler
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from typing import List
from src.config import DATA_PATH_PROCESSED, NUMERIC_COLS, CATEGORICAL_COLS, TARGET_COL, ID_COL

pd.set_option("display.max_columns", None)


In [None]:
df = pd.read_csv(DATA_PATH_PROCESSED)

df.head()

## üßπ 2. Separa√ß√£o de Features e Target

In [None]:
X = df.drop(columns=[TARGET_COL, ID_COL], errors="ignore")
y = df[TARGET_COL]

## üîç 3. Valida√ß√µes de Seguran√ßa (Leakage Check)

### üìå Checklist

-Feature cont√©m informa√ß√£o futura? <br/>
-Feature √© derivada do target? <br/>
-Feature s√≥ existe ap√≥s o evento? <br/>

## üîß 4. Feature Engineering ‚Äî Num√©ricas

### 4.1 Tratamento de Missing

In [None]:
from sklearn.impute import SimpleImputer

numeric_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

### üìå Checklist

-Missing √© informativo? <br/>
-Criar flag de missing? <br/>

### 4.2 Transforma√ß√µes

**Poss√≠veis transforma√ß√µes:**
- Log
- Binning
- Normaliza√ß√£o
- Padroniza√ß√£o

In [None]:
df["log_feature"] = np.log1p(df["feature_exemplo"])

### 4.3 Binning

In [None]:
df["tenure_bucket"] = pd.cut(
    df["tenure"],
    bins=[0, 12, 24, 48, 60, np.inf],
    labels=["0-12", "12-24", "24-48", "48-60", "60+"]
)


## üß© 5. Feature Engineering ‚Äî Categ√≥ricas

### 5.1 Redu√ß√£o de Cardinalidade

In [None]:
def group_rare_categories(series, min_freq=0.05):
    freq = series.value_counts(normalize=True)
    rare = freq[freq < min_freq].index
    return series.replace(rare, "Other")


### 5.2 Encoding

In [None]:
categorical_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ]
)

### üìå Checklist

-Cardinalidade alta? <br/>
-Encoding adequado ao modelo? <br/>

## üîó 6. Features Derivadas / Compostas

**Exemplos:**
- Contagem de servi√ßos ativos
- Flags booleanas
- Raz√µes entre vari√°veis

In [None]:
df["num_active_services"] = (
    (df[["service1", "service2", "service3"]] == "Yes")
    .sum(axis=1)
)

## üî¨ 7. Sele√ß√£o Inicial de Features

In [None]:
from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)


### üìå Checklist

-Features quase constantes? <br/>
-Features redundantes? <br/>

## ‚öôÔ∏è 8. ColumnTransformer Final

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, NUMERIC_COLS),
        ("cat", categorical_pipeline, CATEGORICAL_COLS)
    ],
    remainder="drop"
)


## üß™ 9. Valida√ß√£o do Preprocessamento

In [None]:
X_transformed = preprocessor.fit_transform(X)

X_transformed.shape

## üíæ 10. Salvamento para Produ√ß√£o

In [None]:
from joblib import dump

dump(preprocessor, "models/preprocessor.joblib")


## üß† 11. Registro de Decis√µes

**Decis√µes tomadas:**
- Escalonamento: StandardScaler
- Encoding: OneHotEncoder
- Features criadas:
    - tenure_bucket
    - num_active_services

## üöÄ 12. Pr√≥ximos Passos

**Pr√≥ximos passos:**
- Modelagem
- Cross-validation
- Hyperparameter tuning
- Monitoramento de drift