### Data Augmentation and Model Validation

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.model_selection import LeaveOneOut


In [2]:
data = pd.read_csv('../data/new_data.csv')

In [3]:
X,y = data.drop('Estado al egreso', axis=1), data['Estado al egreso']

In [4]:
import pickle as pkl

with open('../models/new_rf.pkl', 'rb') as file:
    best_rf = pkl.load(file)

For the expanded dataset analysis, SMOTE (Synthetic Minority Oversampling Technique) was applied while maintaining methodological consistency with our original model selection process and using the pre-trained Random Forest model (optimized with 5 key features)

In [5]:
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import f1_score, precision_score, recall_score

loo = LeaveOneOut()
y_true = []
y_pred = []

for train_idx, test_idx in loo.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    n_minority = min(Counter(y_train).values())
    k = min(5, n_minority - 1) 
    
    
    if n_minority > 1 and k > 0:
        smote = SMOTE(k_neighbors=k, random_state=42)
        X_res, y_res = smote.fit_resample(X_train, y_train)
    else:
        X_res, y_res = X_train, y_train
    
    model = best_rf
    model.fit(X_res, y_res)
    y_pred_i = model.predict(X_test)
    
    y_true.append(y_test.values[0])
    y_pred.append(y_pred_i[0])

f1 = f1_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, zero_division=0)
recall = recall_score(y_true, y_pred)

print("\n" + "="*50)
print(f"Resultados con SMOTE (5 features):")
print(f"F1: {f1:.3f}, Precision: {precision:.3f}, Recall: {recall:.3f}")
print("="*50)


Resultados con SMOTE (5 features):
F1: 0.963, Precision: 0.929, Recall: 1.000


#### Conclusions:

The results demonstrated comparable performance metrics between the original and SMOTE-augmented models (F1 ~0.96), indicating that:

- The model was already robust – The original Random Forest (5 features) generalized well without requiring synthetic data.

- Limited benefit from oversampling – Since performance did not improve, the initial training data likely captured the underlying clinical patterns sufficiently.

- Stable decision boundaries – The model’s predictive logic remained consistent even with expanded training folds, suggesting no over-reliance on specific data points.

