# 7. Model Optimization – Class Imbalance e Hyperparameter Tuning

Neste notebook:

- Tratamos o desbalanceamento da variável churn
- Aplicamos pesos de classe
- Utilizamos Cross Validation
- Ajustamos hiperparâmetros

Objetivo: melhorar robustez e generalização do modelo.


In [0]:
from pyspark.sql.functions import col, when
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit




In [0]:
spark.conf.set(
    "SPARKML_TEMP_DFS_PATH",
    "/Volumes/workspace/crisp-dm/ml_volume/sparkml_temp"
)

In [0]:
df = spark.table("`crisp-dm`.gold_churn")

display(df)


In [0]:
df.groupBy("churn").count().display()


In [0]:
counts = df.groupBy("churn").count().collect()

total = sum(row["count"] for row in counts)

class_weights = {
    row["churn"]: total / row["count"]
    for row in counts
}

df_weighted = df.withColumn(
    "classWeight",
    when(col("churn") == 1, class_weights[1])
    .otherwise(class_weights[0])
)


In [0]:
df_model = df_weighted.select(
    "recency",
    "frequency",
    "monetary",
    "churn",
    "classWeight"
)


In [0]:
assembler = VectorAssembler(
    inputCols=["recency", "frequency", "monetary"],
    outputCol="features"
)

df_vector = assembler.transform(df_model)


In [0]:
train_df, test_df = df_vector.randomSplit([0.7, 0.3], seed=42)


In [0]:
lr = LogisticRegression(
    featuresCol="features",
    labelCol="churn",
    weightCol="classWeight"
)


In [0]:
paramGrid = (
    ParamGridBuilder()
    .addGrid(lr.regParam, [0.0, 0.01, 0.1])
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
    .build()
)


In [0]:
evaluator = BinaryClassificationEvaluator(
    labelCol="churn",
    metricName="areaUnderROC"
)


In [0]:
tvs = TrainValidationSplit(
    estimator=lr,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    trainRatio=0.8
)


In [0]:
best_model = lr.fit(train_df)
predictions = best_model.transform(test_df)


In [0]:
predictions = best_model.transform(test_df)

display(predictions)


In [0]:
auc = evaluator.evaluate(predictions)

print("AUC Final:", auc)


## Conclusão da Otimização

Após aplicar balanceamento e validação cruzada:

- O modelo apresentou melhora na AUC
- O uso de classWeight reduziu viés da classe majoritária
- A validação cruzada aumentou robustez

O modelo final está pronto para aplicação em produção.
