# Notebook 4: Entrenamiento de Modelos

En este notebook se entrenan y evalúan varios modelos de clasificación:
- Logistic Regression
- Random Forest
- Gradient Boosting
- SVM
- K-Nearest Neighbors
También se utilizan `GridSearchCV` para optimizar los hiperparámetros de cada modelo.

In [1]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
# Cargarmos los datos preprocesados
X_train = pd.read_csv('../data/processed/X_train_transformed.csv')
X_test = pd.read_csv('../data/processed/X_test_transformed.csv')
y_train = pd.read_csv('../data/processed/y_train.csv', header=None).squeeze()
y_test = pd.read_csv('../data/processed/y_test.csv', header=None).squeeze()

In [None]:
# Definimos modelos y sus hiperparámetros
models = {
    'LogisticRegression': LogisticRegression(),
    'RandomForest': RandomForestClassifier(),
    'GradientBoosting': GradientBoostingClassifier(),
    'SVM': SVC(probability=True),
    'KNeighbors': KNeighborsClassifier()
}

params = {
    'LogisticRegression': {'C': [0.1, 1, 10]},
    'RandomForest': {'n_estimators': [50, 100], 'max_depth': [5, 10]},
    'GradientBoosting': {'learning_rate': [0.01, 0.1], 'n_estimators': [50, 100]},
    'SVM': {'C': [0.1, 1], 'kernel': ['linear', 'rbf']},
    'KNeighbors': {'n_neighbors': [3, 5, 7]}
}


In [None]:

# Entrenamos modelos y realizamos la evaluación
best_models = {}
results = {}
for name, model in models.items():
    print(f'Entrenando {name}...')
    clf = GridSearchCV(model, params[name], cv=5)
    clf.fit(X_train, y_train)
    best_models[name] = clf.best_estimator_
    y_pred = clf.best_estimator_.predict(X_test)
    y_proba = clf.best_estimator_.predict_proba(X_test)[:, 1] if hasattr(clf.best_estimator_, 'predict_proba') else None
    
    metrics = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred)
    }
    if y_proba is not None:
        metrics['ROC-AUC'] = roc_auc_score(y_test, y_proba)
    
    results[name] = metrics

Entrenando LogisticRegression...
Entrenando RandomForest...
Entrenando GradientBoosting...
Entrenando SVM...
Entrenando KNeighbors...


In [None]:
# resultados
results_df = pd.DataFrame(results).T
print(results_df)

                    Accuracy  Precision    Recall  F1-Score   ROC-AUC
LogisticRegression  0.770732   0.822430  0.758621  0.789238  0.875339
RandomForest        0.990244   1.000000  0.982759  0.991304  1.000000
GradientBoosting    0.936585   0.940171  0.948276  0.944206  0.977431
SVM                 0.819512   0.869159  0.801724  0.834081  0.894905
KNeighbors          0.941463   0.964286  0.931034  0.947368  0.987602


#####  El Modelo con mejor desempeño en general es Random Forest ya que presenta los mejores resultados en todas las métricas:
Accuracy: 0.990
Precision: 1.000
Recall: 0.983
F1-Score: 0.991
ROC-AUC: 1.000