# <center>**Trabajo Práctico Final**</center>

### <center>**Predicción de accidentes cerebrovasculares**</center>

#### <center>Aprendizaje de máquina - CEIA, Fiuba</center>

---

**Integrantes del grupo**

- Espínola, Carla
- Gambarte, Antonella
- Putrino, Daniela
- Silvera, Ricardo

---


## **Presentación**

El siguiente proyecto tiene como finalidad realizar un análisis comparativo de distintos modelos con el fin de predecir si una persona puede tener un acv o no. Para ello usaremos el Stroke Prediction Dataset disponible en Kaggle.

Link al dataset: [https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)

Se abordará el análisis exploratorio de los datos, su preprocesamiento y preparación, el empleo de distintos modelos para la predicción y la evaluación de los resultados obtenidos.


## El objetivo de esta notebook es, mediante una herramienta de AutoML, tener un panorama amplio de las distintas opciones de clasificadores y su performance general, para poder profundizar en el Trabajo Final en ellos.
## Sobretodo teniendo en cuenta que el dataset presente está fuertemente desbalanceado, por lo cual nos interesa un alto recall, pero también es importante analizar la precisión dado que tampoco queremos identificar a todas las personas positivas para un ACV, plor eso analizaremos tanto Recall como F1-score como un compromiso.

### Carga de dataset


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score, roc_curve, auc, precision_score, recall_score
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler

In [10]:
df_stroke = pd.read_csv("healthcare-dataset-stroke-data.csv")
df_stroke.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [12]:
# Pycaret AutoML para un modelo de clasificación
from pycaret.classification import *


In [13]:
import numpy as np
np.seterr(all='ignore')

{'divide': 'warn', 'over': 'warn', 'under': 'ignore', 'invalid': 'warn'}

In [15]:
# Convertir los objetos a string
for col in df_stroke.select_dtypes(include='object').columns:
    df_stroke[col] = df_stroke[col].astype(str)

In [None]:
# Setup de Pycaret
s = setup(
    data=df_stroke,
    target='stroke',
    train_size=0.8,  # Proporción de datos de entrenamiento
    session_id=123,
    preprocess=True,
    normalize=True,
    normalize_method='zscore', # 'zscore' (StandardScaler) o'minmax'
    remove_outliers=True,
    outliers_threshold=0.01,
    categorical_imputation='mode',
    numeric_imputation='mean',
    fix_imbalance=True,
    fix_imbalance_method='smote',  # 'random', 'smote', 'adasyn', etc.
    verbose=True,  # True para mostrar información detallada
)

# Cargamos los datos preprocesados

X_train_transformed = s.get_config('X_train_transformed')
y_train_transformed = s.get_config('y_train_transformed')

X_test_transformed = s.get_config('X_test_transformed')
y_test_transformed = s.get_config('y_test_transformed')


print(f"Datos preprocesados: \n\n {X_train_transformed.head()}")
print(f"Dimensiones del dataset preprocesado: {X_train_transformed.shape}")
print(f"Nulos: {X_train_transformed.isnull().sum().sum()}")

  File "c:\Users\dsput\miniconda3\envs\add-aml\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\Users\dsput\miniconda3\envs\add-aml\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\dsput\miniconda3\envs\add-aml\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\Users\dsput\miniconda3\envs\add-aml\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Unnamed: 0,Description,Value
0,Session id,123
1,Target,stroke
2,Target type,Binary
3,Original data shape,"(5110, 12)"
4,Transformed data shape,"(8734, 21)"
5,Transformed train set shape,"(7712, 21)"
6,Transformed test set shape,"(1022, 21)"
7,Numeric features,6
8,Categorical features,5
9,Rows with missing values,3.9%


Datos preprocesados: 

             id  gender_Female  gender_Male  gender_Other       age  \
795   1.119380       0.935508    -0.935267     -0.011388 -1.090430   
4106  0.639956      -1.275681     1.276030     -0.011388 -0.622590   
1318 -1.391224      -1.275681     1.276030     -0.011388 -0.201533   
4846 -0.425722       0.935508    -0.935267     -0.011388 -0.950078   
532  -0.249406       0.935508    -0.935267     -0.011388 -1.371135   

      hypertension  heart_disease  ever_married  work_type_Govt_job  \
795      -0.493311      -0.395372      0.558492            2.915474   
4106     -0.493311      -0.395372      0.558492            2.915474   
1318     -0.493311       3.361229     -1.948108           -0.398395   
4846     -0.493311      -0.395372      0.558492           -0.398395   
532      -0.493311      -0.395372      0.558492           -0.398395   

      work_type_Private  work_type_Self-employed  work_type_children  \
795           -1.344499                -0.550602        

In [21]:
# Classification Functional API Example 

# model training and selection
best = compare_models() # Compara varios modelos usando cross-validation y parámetros default. Se queda con el mejor.

# evaluate trained model
print(f"Métricas del mejor modelo (sin ajuste) con el dataset de test: \n")
evaluate_model(best) # Evalúa el mejor modelo con el dataset de test


# predict on hold-out/test set
pred_holdout = predict_model(best) # Predecir con datos de test o aplicar transformaciones a datos no vistos y predecir
# predict on new data
# predictions = predict_model(best, data = new_data)

print(f"Predicción con dataset de test: \n {pred_holdout}")

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
dummy,Dummy Classifier,0.9513,0.5,0.0,0.0,0.0,0.0,0.0,0.081
rf,Random Forest Classifier,0.9491,0.7732,0.0303,0.29,0.0532,0.0448,0.0778,0.252
lightgbm,Light Gradient Boosting Machine,0.9469,0.8002,0.0705,0.2933,0.1123,0.0963,0.1234,0.371
gbc,Gradient Boosting Classifier,0.9462,0.7957,0.0303,0.156,0.0501,0.0373,0.0503,0.42
et,Extra Trees Classifier,0.945,0.7416,0.0455,0.2333,0.0745,0.0582,0.0808,0.188
ada,Ada Boost Classifier,0.9388,0.8035,0.0705,0.1977,0.098,0.0734,0.0866,0.183
dt,Decision Tree Classifier,0.9024,0.5412,0.1411,0.1096,0.1214,0.071,0.0725,0.064
knn,K Neighbors Classifier,0.8439,0.6306,0.2955,0.1055,0.1551,0.09,0.1049,0.456
lr,Logistic Regression,0.7441,0.8292,0.7895,0.1368,0.2328,0.1633,0.2549,0.731
lda,Linear Discriminant Analysis,0.7317,0.831,0.8095,0.133,0.2283,0.1578,0.2535,0.08


Métricas del mejor modelo (sin ajuste) con el dataset de test: 



interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Dummy Classifier,0.9511,0.5,0.0,0.0,0.0,0.0,0.0


Predicción con dataset de test: 
          id  gender   age  hypertension  heart_disease ever_married  \
2245  14404  Female  13.0             0              0           No   
2111  49254    Male  57.0             1              0          Yes   
4182  27119  Female  28.0             0              0           No   
3718  39632  Female  53.0             0              0          Yes   
2837   9730    Male  27.0             0              0          Yes   
...     ...     ...   ...           ...            ...          ...   
4605  37553    Male  58.0             0              0          Yes   
679   68131  Female  27.0             0              0           No   
1552  24567    Male  51.0             0              0          Yes   
2761   1225    Male  43.0             0              0          Yes   
1173  47735  Female  59.0             0              0          Yes   

          work_type Residence_type  avg_glucose_level        bmi  \
2245       children          Urban          9

### Conclusión general

LDA o el clasificador con regulación son los más prometedores dentro de las opciones disponibles en Pycaret. Son opciones interesantes para profundizar, pero dado que no las hemos abordado en detalle en la materia, profundizaremos en SVM, y otros modelos vistos a modo de comparación en la NB correspondiente al TP Final.