# &#128640; Growth Equestre | Caminho #1 - Lead Scoring com Dois Modelos

**Objetivo:** treinar **2 modelos candidatos** para propensao de qualificacao de leads, aplicar **GridSearchCV + fine tuning** e definir um **criterio profissional de desempate**.

## &#127919; Entregaveis deste notebook
- Modelo campeao (`best_model`)
- Modelo vice (`runner_up_model`)
- Relatorio de comparacao com metricas
- Criterio de desempate documentado para auditoria
- Artefatos exportados em `data/ml/artifacts/`


## &#129517; Estrategia de Modelagem

### Modelos candidatos
1. **Regressao Logistica** (baseline explicavel)
2. **Random Forest** (nao linear, robusto a interacoes)

### Metrica principal
- **ROC-AUC (validacao)**

### Criterio de desempate
Se a diferenca de ROC-AUC entre os 2 melhores for <= `0.005`, desempatar por:
1. maior **PR-AUC**
2. menor **Brier Score**
3. menor **latencia de inferencia**


In [2]:
import importlib.util
import subprocess
import sys

pkg_map = {
    "pandas": "pandas",
    "numpy": "numpy",
    "sklearn": "scikit-learn",   # módulo -> pacote pip correto
    "joblib": "joblib",
    "matplotlib": "matplotlib",
    "seaborn": "seaborn",
    "jinja2": "jinja2",
}

missing = [pip_name for mod_name, pip_name in pkg_map.items()
           if importlib.util.find_spec(mod_name) is None]

subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "pip"])

if missing:
    print("Instalando:", ", ".join(missing))
    subprocess.check_call([sys.executable, "-m", "pip", "install", *missing])

print("OK")


Instalando: scikit-learn, joblib, matplotlib, seaborn
OK


## &#128451;&#65039; Extracao do Dataset (Postgres -> CSV)

Esta celula gera um dataset tabular para treino com features de perfil + eventos do funil.


In [3]:
%%bash
set -euo pipefail

mkdir -p data/ml

QUERY=$(cat <<'SQL'
\copy (
  WITH event_agg AS (
    SELECT
      lead_id,
      COUNT(*)::int AS n_events,
      COUNT(*) FILTER (WHERE event_type = 'page_view')::int AS n_page_view,
      COUNT(*) FILTER (WHERE event_type = 'hook_complete')::int AS n_hook_complete,
      COUNT(*) FILTER (WHERE event_type IN ('cta_click', 'whatsapp_click'))::int AS n_cta_click,
      EXTRACT(EPOCH FROM (now() - MAX(ts))) / 3600.0 AS recency_last_event_hours
    FROM events
    GROUP BY lead_id
  )
  SELECT
    l.id AS lead_id,
    COALESCE(l.uf, '') AS uf,
    COALESCE(l.cidade, '') AS cidade,
    COALESCE(l.segmento_interesse, '') AS segmento_interesse,
    COALESCE(l.orcamento_faixa, '') AS orcamento_faixa,
    COALESCE(l.prazo_compra, '') AS prazo_compra,
    COALESCE(l.status, 'CURIOSO') AS status,
    COALESCE(e.n_events, 0) AS n_events,
    COALESCE(e.n_page_view, 0) AS n_page_view,
    COALESCE(e.n_hook_complete, 0) AS n_hook_complete,
    COALESCE(e.n_cta_click, 0) AS n_cta_click,
    COALESCE(e.recency_last_event_hours, 9999) AS recency_last_event_hours,
    CASE
      WHEN UPPER(COALESCE(l.status, '')) IN ('QUALIFICADO', 'ENVIADO') THEN 1
      ELSE 0
    END AS label_qualified
  FROM leads l
  LEFT JOIN event_agg e ON e.lead_id = l.id
) TO STDOUT WITH CSV HEADER
SQL
)

docker compose exec -T db psql -U app -d appdb -c "$QUERY" > data/ml/lead_scoring_dataset.csv

echo "Dataset gerado em: data/ml/lead_scoring_dataset.csv"


In [4]:
# Bibliotecas padrao para IO e controle do fluxo de treino.
from pathlib import Path
import json
import time

# Persistencia de modelos e manipulacao tabular.
import joblib
import numpy as np
import pandas as pd

# Blocos principais do ecossistema sklearn para pipeline e avaliacao.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    average_precision_score,
    brier_score_loss,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Configuracoes globais para reproducibilidade e padrao de artefatos.
RANDOM_STATE = 42
TARGET_COL = "label_qualified"
DATA_PATH = Path("data/ml/lead_scoring_dataset.csv")
ARTIFACT_DIR = Path("data/ml/artifacts")
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)

# Facilita debug visual no notebook.
pd.set_option("display.max_columns", 200)


In [5]:
# 1) Garantia de existencia do dataset.
if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Arquivo nao encontrado: {DATA_PATH}. Rode a celula de extracao SQL primeiro."
    )

# 2) Carrega o CSV de treino.
df = pd.read_csv(DATA_PATH)

# 3) Validacao de schema minimo esperado pelo pipeline.
required_cols = {
    "uf", "cidade", "segmento_interesse", "orcamento_faixa", "prazo_compra",
    "n_events", "n_page_view", "n_hook_complete", "n_cta_click", "recency_last_event_hours",
}
missing = required_cols.difference(df.columns)
if missing:
    raise ValueError(f"Colunas obrigatorias ausentes no dataset: {sorted(missing)}")

# 4) Se a label nao vier pronta, deriva pelo status comercial final.
if TARGET_COL not in df.columns:
    if "status" not in df.columns:
        raise ValueError("Dataset sem label_qualified e sem status para derivar o alvo.")
    df[TARGET_COL] = df["status"].astype(str).str.upper().isin(["QUALIFICADO", "ENVIADO"]).astype(int)

# 5) O treino supervisionado exige ao menos duas classes no alvo.
if df[TARGET_COL].nunique() < 2:
    raise ValueError(
        "Target com apenas uma classe. Gere mais dados de demo (QUALIFICADO/ENVIADO e CURIOSO/AQUECENDO)."
    )

# 6) Coerce de colunas numericas para garantir compatibilidade no pipeline.
for c in ["n_events", "n_page_view", "n_hook_complete", "n_cta_click", "recency_last_event_hours"]:
    df[c] = pd.to_numeric(df[c], errors="coerce")

# 7) Limpeza final: remove linhas sem target.
before = len(df)
df = df.dropna(subset=[TARGET_COL]).copy()
after = len(df)

# 8) Diagnostico rapido do dataset apos limpeza.
print(f"Linhas totais: {after} (removidas {before - after} sem target)")
print("Distribuicao de classe:")
print(df[TARGET_COL].value_counts(normalize=True).rename("ratio").round(4))

df.head(3)


Linhas totais: 458 (removidas 0 sem target)
Distribuicao de classe:
label_qualified
0    0.8013
1    0.1987
Name: ratio, dtype: float64


Unnamed: 0,lead_id,uf,cidade,segmento_interesse,orcamento_faixa,prazo_compra,status,n_events,n_page_view,n_hook_complete,n_cta_click,recency_last_event_hours,label_qualified
0,a484c6af-f02a-4509-a2b1-53a7f1647db4,GO,Rio Verde,EQUIPAMENTOS,60k+,90d,CURIOSO,2,1,1,0,1932.080053,0
1,80e25dfa-ffd0-4995-8d2f-4baf846035c5,SP,Sao Jose dos Campos,EQUIPAMENTOS,60k+,7d,ENVIADO,8,5,1,2,693.746719,1
2,447641a0-3e48-4e17-aa07-053b00bbfdbf,SP,Campinas,EVENTOS,20k-60k,90d,CURIOSO,1,1,0,0,1806.696719,0


## &#129504; Preparacao de Features e Split

- **Treino:** 70%
- **Validacao:** 15%
- **Teste:** 15%


In [6]:
# Definicao de grupos de features para facilitar manutencao.
numeric_features = [
    "n_events",
    "n_page_view",
    "n_hook_complete",
    "n_cta_click",
    "recency_last_event_hours",
]

categorical_features = [
    "uf",
    "cidade",
    "segmento_interesse",
    "orcamento_faixa",
    "prazo_compra",
]

# Seleciona X e y no formato esperado pelo sklearn.
feature_cols = numeric_features + categorical_features
X = df[feature_cols].copy()
y = df[TARGET_COL].astype(int)

# Split 70/15/15 mantendo proporcao de classes (stratify).
X_train, X_temp, y_train, y_temp = train_test_split(
    X,
    y,
    test_size=0.30,
    stratify=y,
    random_state=RANDOM_STATE,
)

X_valid, X_test, y_valid, y_test = train_test_split(
    X_temp,
    y_temp,
    test_size=0.50,
    stratify=y_temp,
    random_state=RANDOM_STATE,
)

print("Shapes:")
print(f"  Train: {X_train.shape}, Valid: {X_valid.shape}, Test: {X_test.shape}")

# Ajuste dinamico dos folds para evitar erro em datasets pequenos.
class_min_count = y_train.value_counts().min()
cv_splits = max(2, min(5, int(class_min_count)))
print(f"CV folds ajustado para: {cv_splits}")


Shapes:
  Train: (320, 10), Valid: (69, 10), Test: (69, 10)
CV folds ajustado para: 5


In [7]:
# Pipeline de preprocessamento para regressao logistica.
# Inclui scaling para estabilizar coeficientes em features numericas.
preprocess_logit = ColumnTransformer(
    transformers=[
        (
            "num",
            Pipeline(
                steps=[
                    ("imputer", SimpleImputer(strategy="median")),
                    ("scaler", StandardScaler()),
                ]
            ),
            numeric_features,
        ),
        (
            "cat",
            Pipeline(
                steps=[
                    ("imputer", SimpleImputer(strategy="most_frequent")),
                    ("onehot", OneHotEncoder(handle_unknown="ignore")),
                ]
            ),
            categorical_features,
        ),
    ]
)

# Pipeline de preprocessamento para Random Forest.
# Nao precisa de scaling, mas mantem imputacao + one-hot.
preprocess_rf = ColumnTransformer(
    transformers=[
        (
            "num",
            Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))]),
            numeric_features,
        ),
        (
            "cat",
            Pipeline(
                steps=[
                    ("imputer", SimpleImputer(strategy="most_frequent")),
                    ("onehot", OneHotEncoder(handle_unknown="ignore")),
                ]
            ),
            categorical_features,
        ),
    ]
)

# Encadeia preprocessamento + estimador para busca de hiperparametros.
pipe_logit = Pipeline(
    steps=[
        ("prep", preprocess_logit),
        (
            "model",
            LogisticRegression(
                max_iter=2500,
                solver="liblinear",
                random_state=RANDOM_STATE,
            ),
        ),
    ]
)

pipe_rf = Pipeline(
    steps=[
        ("prep", preprocess_rf),
        (
            "model",
            RandomForestClassifier(
                random_state=RANDOM_STATE,
                n_jobs=-1,
            ),
        ),
    ]
)

# Grade base para a 1a rodada de GridSearchCV.
param_grid_logit = {
    "model__C": [0.1, 0.5, 1.0, 2.0, 5.0],
    "model__penalty": ["l1", "l2"],
    "model__class_weight": [None, "balanced"],
}

param_grid_rf = {
    "model__n_estimators": [200, 400, 700],
    "model__max_depth": [None, 8, 16],
    "model__min_samples_split": [2, 5, 10],
    "model__min_samples_leaf": [1, 2, 4],
    "model__class_weight": [None, "balanced", "balanced_subsample"],
}

# CV estratificado para comparacao justa entre combinacoes de hiperparametros.
cv = StratifiedKFold(n_splits=cv_splits, shuffle=True, random_state=RANDOM_STATE)


## &#128269; GridSearchCV (Treino Base)

Executa busca de hiperparametros para os 2 modelos usando **ROC-AUC**.


In [8]:
def run_grid_search(name, pipeline, param_grid, X_train, y_train):
    # Wrapper para evitar repeticao e padronizar logging.
    print(f"\n>>> Treinando {name} com GridSearchCV...")
    gs = GridSearchCV(
        estimator=pipeline,
        param_grid=param_grid,
        scoring="roc_auc",  # metrica principal do projeto
        cv=cv,
        n_jobs=-1,
        verbose=1,
        refit=True,  # refit no melhor conjunto de hiperparametros
    )
    gs.fit(X_train, y_train)
    print(f"Melhor ROC-AUC CV ({name}): {gs.best_score_:.4f}")
    print("Melhores params:", gs.best_params_)
    return gs


# Rodada base (busca ampla) para os dois candidatos.
gs_logit = run_grid_search("LogisticRegression", pipe_logit, param_grid_logit, X_train, y_train)
gs_rf = run_grid_search("RandomForest", pipe_rf, param_grid_rf, X_train, y_train)



>>> Treinando LogisticRegression com GridSearchCV...
Fitting 5 folds for each of 20 candidates, totalling 100 fits




Melhor ROC-AUC CV (LogisticRegression): 0.9774
Melhores params: {'model__C': 0.1, 'model__class_weight': None, 'model__penalty': 'l1'}

>>> Treinando RandomForest com GridSearchCV...
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
Melhor ROC-AUC CV (RandomForest): 0.9805
Melhores params: {'model__class_weight': None, 'model__max_depth': None, 'model__min_samples_leaf': 1, 'model__min_samples_split': 5, 'model__n_estimators': 200}


## &#127912; Fine Tuning (2a rodada)

Refina o espaco de busca ao redor dos melhores parametros encontrados na rodada base.


In [9]:
def build_fine_grid_logit(best_params):
    # Refina C ao redor do melhor valor da rodada base.
    c = float(best_params["model__C"])
    c_candidates = sorted({max(1e-4, round(v, 5)) for v in [c * 0.5, c * 0.75, c, c * 1.25, c * 1.5]})
    return {
        "model__C": c_candidates,
        "model__penalty": [best_params["model__penalty"]],
        "model__class_weight": [best_params["model__class_weight"], "balanced"],
    }


def build_fine_grid_rf(best_params):
    # Refina floresta ao redor do melhor ponto base.
    n_estimators = int(best_params["model__n_estimators"])
    max_depth = best_params["model__max_depth"]
    min_split = int(best_params["model__min_samples_split"])
    min_leaf = int(best_params["model__min_samples_leaf"])

    n_estimators_candidates = sorted({max(100, n_estimators - 150), n_estimators, n_estimators + 150})

    depth_candidates = [max_depth]
    if isinstance(max_depth, int):
        depth_candidates = sorted({max(3, max_depth - 4), max_depth, max_depth + 4})

    return {
        "model__n_estimators": n_estimators_candidates,
        "model__max_depth": depth_candidates,
        "model__min_samples_split": sorted({max(2, min_split - 1), min_split, min_split + 1}),
        "model__min_samples_leaf": sorted({max(1, min_leaf - 1), min_leaf, min_leaf + 1}),
        "model__class_weight": [best_params["model__class_weight"], "balanced_subsample"],
    }


# Grades refinadas da 2a rodada.
fine_grid_logit = build_fine_grid_logit(gs_logit.best_params_)
fine_grid_rf = build_fine_grid_rf(gs_rf.best_params_)

# Rodada de fine tuning.
fine_logit = run_grid_search("LogisticRegression (fine)", pipe_logit, fine_grid_logit, X_train, y_train)
fine_rf = run_grid_search("RandomForest (fine)", pipe_rf, fine_grid_rf, X_train, y_train)



>>> Treinando LogisticRegression (fine) com GridSearchCV...
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Melhor ROC-AUC CV (LogisticRegression (fine)): 0.9774
Melhores params: {'model__C': 0.05, 'model__class_weight': None, 'model__penalty': 'l1'}

>>> Treinando RandomForest (fine) com GridSearchCV...
Fitting 5 folds for each of 36 candidates, totalling 180 fits




Melhor ROC-AUC CV (RandomForest (fine)): 0.9805
Melhores params: {'model__class_weight': None, 'model__max_depth': None, 'model__min_samples_leaf': 1, 'model__min_samples_split': 5, 'model__n_estimators': 200}


## &#128202; Avaliacao e Criterio de Desempate

Regra implementada:
- 1o: maior **ROC-AUC validacao**
- empate tecnico (<= 0.005): maior **PR-AUC validacao**
- persistindo empate: menor **Brier validacao**
- persistindo empate: menor **latencia (ms/registro)**


In [10]:
def evaluate_estimator(name, estimator, X_valid, y_valid, X_test, y_test):
    # Mede tempo de inferencia e coleta probabilidades em validacao.
    t0 = time.perf_counter()
    p_valid = estimator.predict_proba(X_valid)[:, 1]
    latency_valid = (time.perf_counter() - t0) * 1000 / max(1, len(X_valid))

    # Repete medicao em teste para comparacao de estabilidade.
    t1 = time.perf_counter()
    p_test = estimator.predict_proba(X_test)[:, 1]
    latency_test = (time.perf_counter() - t1) * 1000 / max(1, len(X_test))

    # Threshold padrao para metricas baseadas em classe predita.
    yhat_valid = (p_valid >= 0.5).astype(int)
    yhat_test = (p_test >= 0.5).astype(int)

    return {
        "model": name,
        "val_roc_auc": roc_auc_score(y_valid, p_valid),
        "val_pr_auc": average_precision_score(y_valid, p_valid),
        "val_brier": brier_score_loss(y_valid, p_valid),
        "val_f1": f1_score(y_valid, yhat_valid, zero_division=0),
        "val_precision": precision_score(y_valid, yhat_valid, zero_division=0),
        "val_recall": recall_score(y_valid, yhat_valid, zero_division=0),
        "val_latency_ms": latency_valid,
        "test_roc_auc": roc_auc_score(y_test, p_test),
        "test_pr_auc": average_precision_score(y_test, p_test),
        "test_brier": brier_score_loss(y_test, p_test),
        "test_f1": f1_score(y_test, yhat_test, zero_division=0),
        "test_precision": precision_score(y_test, yhat_test, zero_division=0),
        "test_recall": recall_score(y_test, yhat_test, zero_division=0),
        "test_latency_ms": latency_test,
    }


def select_winner(results_df, eps_auc=0.005, eps_pr=0.003, eps_brier=0.002):
    # Ordena por metricas principais para iniciar desempate.
    ranked = results_df.sort_values(["val_roc_auc", "val_pr_auc"], ascending=[False, False]).reset_index(drop=True)
    a = ranked.iloc[0]
    b = ranked.iloc[1]

    reasons = []

    # Regra 1: ROC-AUC.
    if (a["val_roc_auc"] - b["val_roc_auc"]) > eps_auc:
        reasons.append("ROC-AUC superior sem empate tecnico.")
        return a["model"], reasons

    # Regra 2: PR-AUC.
    reasons.append("Empate tecnico em ROC-AUC; aplicando desempate por PR-AUC.")
    if (a["val_pr_auc"] - b["val_pr_auc"]) > eps_pr:
        reasons.append("PR-AUC decidiu o vencedor.")
        return a["model"], reasons

    # Regra 3: Brier Score (menor e melhor).
    reasons.append("Empate tecnico em PR-AUC; aplicando desempate por Brier Score.")
    if (b["val_brier"] - a["val_brier"]) > eps_brier:
        reasons.append("Brier Score decidiu o vencedor (menor e melhor).")
        return a["model"], reasons

    # Regra 4: Latencia media de inferencia.
    reasons.append("Empate tecnico em Brier; aplicando desempate por latencia.")
    if a["val_latency_ms"] <= b["val_latency_ms"]:
        reasons.append("Latencia de inferencia decidiu o vencedor.")
        return a["model"], reasons

    reasons.append("Latencia decidiu o vencedor (modelo B).")
    return b["model"], reasons


# Avalia somente os melhores da rodada de fine tuning.
results = [
    evaluate_estimator("logit_fine", fine_logit.best_estimator_, X_valid, y_valid, X_test, y_test),
    evaluate_estimator("rf_fine", fine_rf.best_estimator_, X_valid, y_valid, X_test, y_test),
]

results_df = pd.DataFrame(results).sort_values("val_roc_auc", ascending=False).reset_index(drop=True)
winner_name, winner_reasons = select_winner(results_df)

results_df


Unnamed: 0,model,val_roc_auc,val_pr_auc,val_brier,val_f1,val_precision,val_recall,val_latency_ms,test_roc_auc,test_pr_auc,test_brier,test_f1,test_precision,test_recall,test_latency_ms
0,logit_fine,0.967532,0.818432,0.069498,0.695652,0.888889,0.571429,0.059996,1.0,1.0,0.038144,1.0,1.0,1.0,0.057081
1,rf_fine,0.95974,0.844241,0.073611,0.695652,0.888889,0.571429,0.430552,1.0,1.0,0.016229,1.0,1.0,1.0,0.426639


In [None]:
# Tabela com realce visual para facilitar leitura de decisao no pitch.
display(
    results_df.style
    .format({
        "val_roc_auc": "{:.4f}",
        "val_pr_auc": "{:.4f}",
        "val_brier": "{:.4f}",
        "val_f1": "{:.4f}",
        "val_precision": "{:.4f}",
        "val_recall": "{:.4f}",
        "val_latency_ms": "{:.4f}",
        "test_roc_auc": "{:.4f}",
        "test_pr_auc": "{:.4f}",
        "test_brier": "{:.4f}",
        "test_f1": "{:.4f}",
        "test_precision": "{:.4f}",
        "test_recall": "{:.4f}",
        "test_latency_ms": "{:.4f}",
    })
    # Verde: quanto maior melhor.
    .background_gradient(subset=["val_roc_auc", "val_pr_auc", "test_roc_auc", "test_pr_auc"], cmap="Greens")
    # Vermelho invertido: quanto menor melhor (brier e latencia).
    .background_gradient(subset=["val_brier", "test_brier", "val_latency_ms", "test_latency_ms"], cmap="OrRd_r")
)

print(f"[WINNER] Modelo vencedor: {winner_name}")
for r in winner_reasons:
    print(f"  - {r}")


## &#128190; Persistencia dos Artefatos

Salva campeao, vice e relatorio de selecao para integracao com API de scoring.


In [11]:
# Mapeia nomes para os estimadores finais treinados.
model_map = {
    "logit_fine": fine_logit.best_estimator_,
    "rf_fine": fine_rf.best_estimator_,
}

# Define campeao e vice para uso operacional (champion/challenger).
best_model = model_map[winner_name]
runner_up_name = [m for m in model_map.keys() if m != winner_name][0]
runner_up_model = model_map[runner_up_name]

# Persistencia de modelos em joblib para inferencia no scoring_service.
joblib.dump(best_model, ARTIFACT_DIR / "lead_scoring_best_model.joblib")
joblib.dump(runner_up_model, ARTIFACT_DIR / "lead_scoring_runner_up_model.joblib")

# Relatorio para auditoria e reproducao da decisao de modelo.
report = {
    "winner": winner_name,
    "runner_up": runner_up_name,
    "selection_reasons": winner_reasons,
    "metrics": results_df.to_dict(orient="records"),
    "random_state": RANDOM_STATE,
    "target_col": TARGET_COL,
}

with open(ARTIFACT_DIR / "model_selection_report.json", "w", encoding="utf-8") as f:
    json.dump(report, f, ensure_ascii=False, indent=2)

print("[OK] Artefatos salvos com sucesso:")
print(f" - {ARTIFACT_DIR / 'lead_scoring_best_model.joblib'}")
print(f" - {ARTIFACT_DIR / 'lead_scoring_runner_up_model.joblib'}")
print(f" - {ARTIFACT_DIR / 'model_selection_report.json'}")


[OK] Artefatos salvos com sucesso:
 - data\ml\artifacts\lead_scoring_best_model.joblib
 - data\ml\artifacts\lead_scoring_runner_up_model.joblib
 - data\ml\artifacts\model_selection_report.json


In [12]:
%%bash
set -euo pipefail

echo "Resumo dos artefatos"
ls -lh data/ml/artifacts


Resumo dos artefatos
total 792K
-rwxrwxrwx 1 luizandre luizandre 5.9K Feb 15 16:55 lead_scoring_best_model.joblib
-rwxrwxrwx 1 luizandre luizandre 779K Feb 15 16:55 lead_scoring_runner_up_model.joblib
-rwxrwxrwx 1 luizandre luizandre 1.3K Feb 15 16:55 model_selection_report.json


## &#128268; Proximo Passo de Integracao

1. Copiar `lead_scoring_best_model.joblib` para o servico de scoring
2. Atualizar o endpoint `/score` para consumir o modelo campeao
3. Manter o `runner_up_model` como fallback tecnico
4. Versionar o JSON de selecao para auditoria no pitch


## &#9203; Execução CLI (sem notebook)

Se preferir rodar fora do Jupyter, use o script:

```bash
python tools/ml/train_lead_scoring.py --input-csv data/ml/lead_scoring_dataset.csv --output-dir data/ml/artifacts
```

In [14]:
from pathlib import Path
import subprocess
import sys

# Ajusta raiz do repo automaticamente
cwd = Path.cwd()
repo_root = cwd
for p in [cwd, *cwd.parents]:
    if (p / "tools" / "ml" / "train_lead_scoring.py").exists():
        repo_root = p
        break

cmd = [
    sys.executable,
    str(repo_root / "tools" / "ml" / "train_lead_scoring.py"),
    "--input-csv",
    str(repo_root / "data" / "ml" / "lead_scoring_dataset.csv"),
    "--output-dir",
    str(repo_root / "data" / "ml" / "artifacts"),
]

print("Executando:", " ".join(cmd))
subprocess.check_call(cmd, cwd=repo_root)
print("Treino via CLI concluido com sucesso.")


Executando: c:\Users\USER\Documents\Repositorios\growth_equestre_hackathon_2026_backup\.venv\Scripts\python.exe c:\Users\USER\Documents\Repositorios\growth_equestre_hackathon_2026_backup\tools\ml\train_lead_scoring.py --input-csv c:\Users\USER\Documents\Repositorios\growth_equestre_hackathon_2026_backup\data\ml\lead_scoring_dataset.csv --output-dir c:\Users\USER\Documents\Repositorios\growth_equestre_hackathon_2026_backup\data\ml\artifacts
Treino via CLI concluido com sucesso.
