# Oversampling con SMOTE per la classificazione multiclasse

In questo notebook è stato applicato l'**oversampling** tramite SMOTE alla classe 2 (la più minoritaria). Dalle analisi risultava che la classe 2 avesso solo 70 istanze.
Successivamente sono stati addestrati e valutati 4 modelli di classificazione:
- MLP (rete neurale)
- KNN
- SVM
- Balanced Random Forest

L’obiettivo è ridurre l’overfitting e migliorare la generalizzazione del modello.
Viene utilizzato standard scaler per i modelli mlp knn e svm mentre per balanced rf non è necessaria. La valutazione dei modelli viene fatta su:
- training set (oversamplato)
- test set (non oversamplato)

In [None]:
import pandas as pd
import numpy as np
import sys
sys.path.append('../Scripts')
from utility import evaluate_and_save_model_multiclass
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler

X_train = pd.read_csv("../data/splitted_category/X_train.csv")
X_test = pd.read_csv("../data/splitted_category/X_test.csv")
y_train = pd.read_csv("../data/splitted_category/y_train.csv").values.ravel()
y_test = pd.read_csv("../data/splitted_category/y_test.csv").values.ravel()

smote = SMOTE(sampling_strategy={2: 190}, random_state=42) # portiamo la classe 2 a 190 esempi
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

models = {
    "MLP": MLPClassifier(hidden_layer_sizes=(64, 32), activation='tanh',
                         alpha=0.01, learning_rate_init=0.01, solver='adam', 
                         max_iter=300,early_stopping=True,                  
                         validation_fraction=0.2, random_state=42),
    "KNN": KNeighborsClassifier(n_neighbors=9, weights='distance'),
    "SVM": SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42),

    "Balanced random forest": BalancedRandomForestClassifier(n_estimators=100,max_depth=15, min_samples_leaf=1, min_samples_split=2,max_features='log2',random_state=42)                                        

}

use_scaled = ["MLP", "KNN","SVM"]
for name, model in models.items():
    print(f"\n=== {name} ===")
    
    if name in use_scaled:
        model.fit(X_train_scaled, y_train_resampled)
        y_pred_train = model.predict(X_train_scaled)
        y_pred_test = model.predict(X_test_scaled)
        evaluate_and_save_model_multiclass(
            model,
            name,
            y_train_resampled,
            y_pred_train,
            y_test,
            y_pred_test,
            "../results/classification_category/oversampling",
            f"../models/{name}_oversampling_category.joblib",
            {"oversampling":" solo sulle classi minoritarie 2"}

        )
    else:
        model.fit(X_train_resampled, y_train_resampled)
        y_pred_train = model.predict(X_train_resampled)
        y_pred_test = model.predict(X_test)
        evaluate_and_save_model_multiclass(
            model,
            name,
            y_train_resampled,
            y_pred_train,
            y_test,
            y_pred_test,
            "../results/classification_category/oversampling",
            f"../models/{name}_oversampling_category.joblib",
            {"oversampling":" solo sulle classi minoritarie 2"}
        )    



=== MLP ===

=== KNN ===

=== SVM ===

=== Balanced random forest ===


In [2]:
import pandas as pd
print("Distribuzione classi nel training:")
print(pd.Series(y_train).value_counts().sort_index())


Distribuzione classi nel training:
0    166
1    131
2     70
3    396
4    382
5    389
6    314
Name: count, dtype: int64
