## Ataque al corazón usando árboles de decisión

El dataset que vamos a trabajar a continuación, viene del enlace https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

Tiene como características las siguientes:

1. **Age**: age of the patient [years]  
2. **Sex**: sex of the patient [M: Male, F: Female]  
3. **ChestPainType**: chest pain type  
   - TA: Typical Angina  
   - ATA: Atypical Angina  
   - NAP: Non-Anginal Pain  
   - ASY: Asymptomatic  
4. **RestingBP**: resting blood pressure [mm Hg]  
5. **Cholesterol**: serum cholesterol [mm/dl]  
6. **FastingBS**: fasting blood sugar  
   - 1: if FastingBS > 120 mg/dl  
   - 0: otherwise  
7. **RestingECG**: resting electrocardiogram results  
   - Normal: Normal  
   - ST: ST-T wave abnormality  
   - LVH: probable or definite left ventricular hypertrophy  
8. **MaxHR**: maximum heart rate achieved (between 60 and 202)  
9. **ExerciseAngina**: exercise-induced angina  
   - Y: Yes  
   - N: No  
10. **Oldpeak**: ST depression induced by exercise relative to rest  
11. **ST_Slope**: the slope of the peak exercise ST segment  
    - Up: upsloping  
    - Flat: flat  
    - Down: downsloping  
12. **HeartDisease**: output class  
    - 1: heart disease  
    - 0: normal  

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

#vamos a leer los datos
df = pd.read_csv('sample_data/heart.csv')
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [None]:
#separamos las variables para predecir
# Aplica one-hot encoding a todas las variables categóricas automáticamente
df_encoded = pd.get_dummies(df, drop_first=True)

X = df_encoded.drop("HeartDisease", axis=1)
y = df_encoded["HeartDisease"]


In [None]:
#vamos a dividir en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y)


In [None]:
#entrenamos el modelo
clf = DecisionTreeClassifier(
    criterion='gini',       # o 'entropy'
    max_depth=4,            # para evitar sobreajuste
    random_state=42
)

clf.fit(X_train, y_train)


In [None]:
#realizamos predicciones
y_pred = clf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[63 19]
 [11 91]]
              precision    recall  f1-score   support

           0       0.85      0.77      0.81        82
           1       0.83      0.89      0.86       102

    accuracy                           0.84       184
   macro avg       0.84      0.83      0.83       184
weighted avg       0.84      0.84      0.84       184

