# MachineLearning_LargeClass_sk — **VERSÃO CORRIGIDA (sem data leakage)**

Este notebook corrige o problema de *data leakage* causado por balanceamento antes do split.
Agora o balanceamento é aplicado **apenas no conjunto de treino** e, no caso do PyCaret, **dentro dos folds** via SMOTE.

> **Binário**: código preparado para classificação binária (`smell_label` ∈ {0,1} ou equivalentes).


In [8]:
# =========================
# Imports
# =========================
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score, f1_score
from sklearn.preprocessing import StandardScaler
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression

try:
    from pycaret.classification import setup, compare_models, pull, finalize_model, predict_model, get_config, save_model
    _HAS_PYCARET = True
except Exception as e:
    print('PyCaret não disponível neste ambiente. Você ainda pode usar a Opção B (Sklearn/Imblearn).')
    _HAS_PYCARET = False


## 1) Carregar dados

> Ajuste o caminho do CSV se necessário.


In [None]:
# =========================
# 1) Load data
# =========================
CSV_PATH = 'dataset_large_class.csv'  
TARGET = 'smell_label'

df = pd.read_csv(CSV_PATH)
print(df.shape, df.columns.tolist())
df[TARGET] = df[TARGET].astype('category')  # assegura tipo categórico
df.head()


(8169, 14) ['raw_sloc', 'raw_multi', 'raw_blank', 'raw_single_comments', 'hal_func_N2', 'hal_func_vocabulary', 'hal_func_length', 'hal_func_calculated_length', 'hal_func_volume', 'hal_func_difficulty', 'hal_func_effort', 'hal_func_time', 'hal_func_bugs', 'smell_label']


Unnamed: 0,raw_sloc,raw_multi,raw_blank,raw_single_comments,hal_func_N2,hal_func_vocabulary,hal_func_length,hal_func_calculated_length,hal_func_volume,hal_func_difficulty,hal_func_effort,hal_func_time,hal_func_bugs,smell_label
0,488.0,284.0,104.0,0.0,54.0,57.0,83.0,294.284957,484.129871,5.744681,2781.1716,154.509533,0.161377,large-class
1,179.0,421.0,138.0,0.0,83.0,68.0,125.0,381.426462,760.932855,4.762295,3623.786794,201.321489,0.253644,large-class
2,581.0,1939.0,685.0,0.0,397.0,245.0,596.0,1915.00633,4730.236212,3.294606,15584.263701,865.792428,1.576745,large-class
3,454.0,1672.0,494.0,0.0,273.0,180.0,410.0,1310.581943,3071.659769,4.706897,14457.984777,803.221377,1.023887,large-class
4,693.0,661.0,192.0,0.0,379.0,304.0,582.0,2400.966744,4800.293813,13.34507,64060.258981,3558.903277,1.600098,large-class


## 2) Split estratificado (sem balancear aqui)

Criamos **train/test** primeiro para evitar vazamento. Qualquer balanceamento ocorrerá **apenas no treino**.


In [4]:
# =========================
# 2) Train/Test Split
# =========================
X = df.drop(columns=[TARGET])
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print('Distribuição antes (dataset completo):')
print(y.value_counts())
print('\nTreino:', y_train.shape, ' Teste:', y_test.shape)
print('\nDistribuição no treino:')
print(y_train.value_counts())
print('\nDistribuição no teste:')
print(y_test.value_counts())


Distribuição antes (dataset completo):
smell_label
non-large-class    8130
large-class          39
Name: count, dtype: int64

Treino: (6535,)  Teste: (1634,)

Distribuição no treino:
smell_label
non-large-class    6504
large-class          31
Name: count, dtype: int64

Distribuição no teste:
smell_label
non-large-class    1626
large-class           8
Name: count, dtype: int64


---

## 3A) **Opção A** — PyCaret com SMOTE **dentro do CV** (recomendado se você já usa PyCaret)

- O `setup` usa **apenas o treino**.
- `fix_imbalance=True` e `SMOTE` aplicado **dentro de cada fold**, evitando vazamento.
- Modelo final é avaliado no **conjunto de teste externo**.

> Se PyCaret não estiver instalado, pule para a **Opção B**.


In [None]:
# =========================
# 3A) PyCaret (se disponível)
# =========================
if _HAS_PYCARET:
    from imblearn.over_sampling import SMOTE  # para passar instância ao PyCaret

    train_df = pd.concat([X_train.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1)

    s = setup(
        data=train_df,
        target=TARGET,
        session_id=42,
        fold=10,
        fix_imbalance=True,
        fix_imbalance_method=SMOTE(k_neighbors=5, random_state=42),
        normalize=False,  
        use_gpu=False, verbose=False
    )

    best = compare_models(sort='F1', turbo=False)
    results = pull()
    print('Top modelos (CV interno no treino):')
    display(results.head())

    final_model = finalize_model(best)

    # Avaliação em teste externo
    test_df = pd.concat([X_test.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1)
    preds = predict_model(final_model, data=test_df)

    # PyCaret retorna coluna 'Label' com predições
    y_true = test_df[TARGET].astype(str).values
    y_pred = preds['prediction_label'].astype(str).values

    # Se houver Score/probabilidade para binário, calcula AUCs
    prob_col = 'prediction_score'
    if prob_col in preds.columns and preds[prob_col].notna().all():
        try:
            # Converter rótulos para {0,1} quando possível
            classes = sorted(list(pd.Series(y_true).unique()))
            if len(classes) == 2:
                # map to 0/1
                mapping = {cls:i for i, cls in enumerate(classes)}
                y_true_bin = pd.Series(y_true).map(mapping).values
                # Score do PyCaret costuma ser prob da classe positiva (verifique mapeamento dos labels)
                roc_auc = roc_auc_score(y_true_bin, preds[prob_col].values)
                pr_auc  = average_precision_score(y_true_bin, preds[prob_col].values)
                print(f'ROC-AUC (teste): {roc_auc:.4f}')
                print(f'PR-AUC  (teste): {pr_auc:.4f}')
        except Exception as e:
            print('Não foi possível calcular AUCs:', e)

    print('\nClassification report (teste):')
    print(classification_report(y_true, y_pred, digits=4))
    print('Confusion matrix (teste):')
    print(confusion_matrix(y_true, y_pred))
else:
    print('PyCaret não disponível — siga para a Opção B.')


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
mlp,MLP Classifier,0.9926,0.9452,0.9926,0.9967,0.9941,0.5396,0.5843,0.317
lr,Logistic Regression,0.9926,0.8007,0.9926,0.9957,0.9939,0.4825,0.5114,0.285
knn,K Neighbors Classifier,0.9864,0.8698,0.9864,0.9953,0.9902,0.3468,0.4102,0.179
ridge,Ridge Classifier,0.9832,0.8182,0.9832,0.9948,0.9882,0.2907,0.357,0.171
lda,Linear Discriminant Analysis,0.9829,0.8186,0.9829,0.9948,0.9881,0.2894,0.3559,0.244
nb,Naive Bayes,0.9816,0.9207,0.9816,0.9955,0.9875,0.3067,0.3978,0.007
qda,Quadratic Discriminant Analysis,0.9746,0.9204,0.9746,0.9953,0.9836,0.2483,0.3493,0.009
lightgbm,Light Gradient Boosting Machine,0.9637,0.9835,0.9637,0.995,0.9777,0.2093,0.3011,0.067
gpc,Gaussian Process Classifier,0.9285,0.8713,0.9285,0.9948,0.9585,0.0963,0.2048,32.335
rbfsvm,SVM - Radial Kernel,0.8185,0.926,0.8185,0.9945,0.8955,0.0344,0.1194,1.53


Top modelos (CV interno no treino):


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
mlp,MLP Classifier,0.9926,0.9452,0.9926,0.9967,0.9941,0.5396,0.5843,0.317
lr,Logistic Regression,0.9926,0.8007,0.9926,0.9957,0.9939,0.4825,0.5114,0.285
knn,K Neighbors Classifier,0.9864,0.8698,0.9864,0.9953,0.9902,0.3468,0.4102,0.179
ridge,Ridge Classifier,0.9832,0.8182,0.9832,0.9948,0.9882,0.2907,0.357,0.171
lda,Linear Discriminant Analysis,0.9829,0.8186,0.9829,0.9948,0.9881,0.2894,0.3559,0.244


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,MLP Classifier,0.9474,0.8148,0.9474,0.9941,0.9687,0.1145,0.2137


ROC-AUC (teste): 0.3940
PR-AUC  (teste): 0.9932

Classification report (teste):
                 precision    recall  f1-score   support

    large-class     0.0667    0.7500    0.1224         8
non-large-class     0.9987    0.9483    0.9729      1626

       accuracy                         0.9474      1634
      macro avg     0.5327    0.8492    0.5477      1634
   weighted avg     0.9941    0.9474    0.9687      1634

Confusion matrix (teste):
[[   6    2]
 [  84 1542]]


In [9]:
save_model(final_model, 'final_model_pycaret_large_class')


Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['raw_sloc', 'raw_multi',
                                              'raw_blank', 'raw_single_comments',
                                              'hal_func_N2',
                                              'hal_func_vocabulary',
                                              'hal_func_length',
                                              'hal_func_calculated_length',...
                                batch_size='auto', beta_1=0.9, beta_2=0.999,
                                early_stopping=False, epsilon=1e-08,
                                hidden_layer_sizes=(100,),
                                learning_rate='c

---

## 3B) **Opção B** — Sklearn + Imbalanced-learn (SMOTE **apenas no treino** via Pipeline)

- Pipeline aplica `StandardScaler` → `SMOTE` (somente no `fit`) → `LogisticRegression`.
- Validação com `RepeatedStratifiedKFold` no treino e, ao final, avaliação no teste externo.

> Ajuste o class_weight, modelo ou métricas conforme necessidade.


In [6]:
# =========================
# 3B) Sklearn + Imbalanced-learn
# =========================
# Para casos com recursos numéricos; se tiver categóricas, considere OneHotEncoder ou SMOTENC.
numeric_pipeline = ImbPipeline(steps=[
    ('scale', StandardScaler(with_mean=False) if hasattr(X_train, 'tocsc') else StandardScaler()),
    ('smote', SMOTE(k_neighbors=5, random_state=42)),
    ('clf', LogisticRegression(max_iter=5000, class_weight='balanced', n_jobs=None))
])

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
scores = cross_val_score(numeric_pipeline, X_train, y_train, cv=cv, scoring='f1_macro', n_jobs=-1)
print(f'CV f1_macro (mean±std): {scores.mean():.4f} ± {scores.std():.4f}')

# Fit no treino e avaliar no teste
numeric_pipeline.fit(X_train, y_train)
y_pred_test = numeric_pipeline.predict(X_test)

print('\nClassification report (teste):')
print(classification_report(y_test, y_pred_test, digits=4))
print('Confusion matrix (teste):')
print(confusion_matrix(y_test, y_pred_test))

# Se modelo tiver predict_proba, calcular AUCs
try:
    y_prob_test = numeric_pipeline.predict_proba(X_test)[:, 1]
    # Mapear y_test para {0,1} se necessário
    if y_test.dtype.name == 'category':
        y_true_bin = (y_test.cat.codes.values).astype(int)
    else:
        # tentativa genérica (ajuste se necessário)
        classes = sorted(list(pd.Series(y_test).unique()))
        mapping = {cls:i for i, cls in enumerate(classes)}
        y_true_bin = pd.Series(y_test).map(mapping).values
    print(f'ROC-AUC (teste): {roc_auc_score(y_true_bin, y_prob_test):.4f}')
    print(f'PR-AUC  (teste): {average_precision_score(y_true_bin, y_prob_test):.4f}')
except Exception as e:
    print('AUCs não calculadas:', e)


CV f1_macro (mean±std): 0.7759 ± 0.0496

Classification report (teste):
                 precision    recall  f1-score   support

    large-class     0.3077    0.5000    0.3810         8
non-large-class     0.9975    0.9945    0.9960      1626

       accuracy                         0.9920      1634
      macro avg     0.6526    0.7472    0.6885      1634
   weighted avg     0.9942    0.9920    0.9930      1634

Confusion matrix (teste):
[[   4    4]
 [   9 1617]]
ROC-AUC (teste): 0.7659
PR-AUC  (teste): 0.9983
