In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**Características**

Además, el conjunto de datos está compuesto por 13 características adicionales que describen las condiciones de salud de cada uno de los pacientes. Estas características se describen a continuación:


1. **age**: Edad del paciente
  - Rango de edades de la muestra 27-77 años. En formato XX.X
  Por ejemplo: 27.0
2. **sex**: Sexo del paciente
  - 1: Sexo masculino
  - 0: Sexo femenino
3. **cp**: Tipo de dolor de pecho
  - 1: Angina típica
  - 2: Angina atípica
  - 3: Dolor no-anginoso
  - 4: Asintomático
4. **trestbps**: Presión arterial en reposo (en mm Hg al ingreso en el hospital)
5. **chol**: Colesterol sérico en mg/dl
6. **fbs**: Dolor provocado por el esfuerzo (1 = sí; 0 = no)
7. **restecg**: Resultados electrocardiográficos en reposo
  - 0: Normal
  - 1: Presenta anormalidad de la onda ST-T
  - 2: Presenta probable o definida hipertrofia ventricular izquierda
8. **thalach**: Frecuencia cardíaca en reposo
9. **exang**: Angina inducida por el ejercicio (1 = sí; 0 = no)
10. **oldpeak**: Depresión del ST inducida por el ejercicio en relación con el reposo
11. **slope**: La pendiente del segmento ST en ejercicio máximo
  - 1: Pendiente ascendente
  - 2: Plano
  - 3: Pendiente descendente
12. **ca**: Número de vasos mayores (0-3) coloreados por fluoroscopia
13. **thal**:
  - 3: Normal
  - 6: Defecto fijo
  - 7: Defecto reversible

Importamos el dataset de entrenamiento.

In [2]:
data_train = pd.read_csv('/Users/emart/Documents/GitHub/mdata_dp3/data/train.csv')
data_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,label
0,51.0,1.0,1.0,125.0,213.0,0.0,2.0,125.0,1.0,1.4,1.0,1.0,3.0,0
1,54.0,1.0,3.0,120.0,237.0,0.0,0.0,150.0,1.0,1.5,-9.0,-9.0,7.0,2
2,63.0,1.0,4.0,140.0,0.0,?,2.0,149.0,0.0,2.0,1.0,?,?,2
3,52.0,0.0,2.0,140.0,-9.0,0.0,0.0,140.0,0.0,0.0,-9.0,-9.0,-9.0,0
4,55.0,1.0,4.0,140.0,217.0,0.0,0.0,111.0,1.0,5.6,3.0,0.0,7.0,3


Importamos el dataset de prueba.

In [3]:
data_test = pd.read_csv('/Users/emart/Documents/GitHub/mdata_dp3/data/test.csv')
data_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,57.0,1.0,4.0,156.0,173.0,0,2.0,119.0,1.0,3.0,3.0,?,?
1,52.0,1.0,2.0,160.0,196.0,0.0,0.0,165.0,0.0,0.0,-9.0,-9.0,-9.0
2,48.0,1.0,2.0,100.0,-9.0,0.0,0.0,100.0,0.0,0.0,-9.0,-9.0,-9.0
3,62.0,1.0,4.0,115.0,0.0,?,0.0,128.0,1.0,2.5,3.0,?,?
4,51.0,1.0,3.0,110.0,175.0,0.0,0.0,123.0,0.0,0.6,1.0,0.0,3.0


Transformamos los valores desconocidos a NaN. 

In [4]:
columns_train = data_train.columns
for column in columns_train:
    """ Si el valor de la columna es '?', se reemplaza por None. Después, se convierte el tipo de dato a float. """
    data_train[column] = data_train[column].replace('?', None)
    data_train[column] = data_train[column].astype(float)

data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       732 non-null    float64
 1   sex       732 non-null    float64
 2   cp        732 non-null    float64
 3   trestbps  685 non-null    float64
 4   chol      727 non-null    float64
 5   fbs       674 non-null    float64
 6   restecg   732 non-null    float64
 7   thalach   688 non-null    float64
 8   exang     688 non-null    float64
 9   oldpeak   683 non-null    float64
 10  slope     637 non-null    float64
 11  ca        483 non-null    float64
 12  thal      563 non-null    float64
 13  label     732 non-null    float64
dtypes: float64(14)
memory usage: 80.2 KB


In [5]:
columns_test = data_test.columns
for column in columns_test:
    """ Si el valor de la columna es '?', se reemplaza por None. Después, se convierte el tipo de dato a float. """
    data_test[column] = data_test[column].replace('?', None)
    data_test[column] = data_test[column].astype(float)

data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184 entries, 0 to 183
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       184 non-null    float64
 1   sex       184 non-null    float64
 2   cp        184 non-null    float64
 3   trestbps  173 non-null    float64
 4   chol      182 non-null    float64
 5   fbs       161 non-null    float64
 6   restecg   183 non-null    float64
 7   thalach   174 non-null    float64
 8   exang     174 non-null    float64
 9   oldpeak   171 non-null    float64
 10  slope     160 non-null    float64
 11  ca        115 non-null    float64
 12  thal      135 non-null    float64
dtypes: float64(13)
memory usage: 18.8 KB


#### TRATAMIENTO DE LAS VARIABLES.

A continuación, en función de la explicación obtenida del enunciado, vamos a definir que variables son continuas y cuáles son categóricas:
- **Continuas**: age, trestbps, chol, thalach, oldpeak.
- **Categóricas**: sex, cp, fbs, restecg, exang, slope, ca, thal
- **Resultado**: label

Por lo tanto, observamos que no todas las variables cumplen con su naturaleza de dato. Procedemos entonces a estipular el tipo en cada una de ellas.

La edad debe estar comprendida entre 27 y 77 años. No podemos tener números decimales.

In [6]:
# AGE
# ---------------------
# data_train['age'].describe()
# data_test['age'].describe()
# ---------------------

data_train['age'] = data_train['age'].astype(int)
data_test['age'] = data_test['age'].astype(int)

data_train['age'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 732 entries, 0 to 731
Series name: age
Non-Null Count  Dtype
--------------  -----
732 non-null    int64
dtypes: int64(1)
memory usage: 5.8 KB


El sexo debe ser 0 (mujer) o 1 (hombre). 

In [7]:
# SEX
# ---------------------
# data_train['sex'].unique()
# data_test['sex'].unique()
# ---------------------

data_train['sex'] = data_train['sex'].astype('category')
data_test['sex'] = data_test['sex'].astype('category')

data_train['sex'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 732 entries, 0 to 731
Series name: sex
Non-Null Count  Dtype   
--------------  -----   
732 non-null    category
dtypes: category(1)
memory usage: 988.0 bytes


La variable CP comprende valores categóricos: 1, 2, 3, 4.

In [8]:
# CP 
# ---------------------
# data_train['cp'].unique()
# data_test['cp'].unique()
# ---------------------

data_train['cp'] = data_train['cp'].astype('category')
data_test['cp'] = data_test['cp'].astype('category')

data_train['cp'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 732 entries, 0 to 731
Series name: cp
Non-Null Count  Dtype   
--------------  -----   
732 non-null    category
dtypes: category(1)
memory usage: 1.0 KB


La presión arterial 'trestbps' es un valor continuo. Se entiendo que debe estar en un rango positivo. No entramos en detalle de si a x valor es posible o se considera nulo.

In [9]:
# TRESTBPS
# ---------------------
# data_train['trestbps'].describe()
# data_test['trestbps'].describe() - Contiene valores negativos.
# ---------------------

""" Se han detectado valores negativos para el dataset de prueba. """
data_test['trestbps'] = data_test['trestbps'].apply(lambda x: x if x > 0 else None)
data_train['trestbps'] = data_train['trestbps'].apply(lambda x: x if x > 0 else None)

data_test['trestbps'].describe()

count    172.00000
mean     132.80814
std       18.51547
min       80.00000
25%      120.00000
50%      130.00000
75%      144.00000
max      200.00000
Name: trestbps, dtype: float64

In [10]:
data_train['trestbps'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 732 entries, 0 to 731
Series name: trestbps
Non-Null Count  Dtype  
--------------  -----  
684 non-null    float64
dtypes: float64(1)
memory usage: 5.8 KB


El colesterol no admite valores negativos. No entramos en detalle de si a x valor es posible o se considera nulo.

In [11]:
# CHOL
# ---------------------
# data_train['chol'].describe() - Contiene valores negativos.
# data_test['chol'].describe() - Contiene valores negativos.
# ---------------------

""" Se han detectado valores negativos para el dataset de prueba. """
data_train['chol'] = data_train['chol'].apply(lambda x: x if x > 0 else None)
data_test['chol'] = data_test['chol'].apply(lambda x: x if x > 0 else None)

print("Train:")
print("---------------------")
print(data_train['chol'].describe())
print("\nTest:")
print("---------------------")
print(data_test['chol'].describe())

Train:
---------------------
count    576.000000
mean     246.821181
std       60.057170
min       85.000000
25%      208.750000
50%      240.000000
75%      279.250000
max      603.000000
Name: chol, dtype: float64

Test:
---------------------
count    139.000000
mean     246.582734
std       52.469853
min      149.000000
25%      212.000000
50%      238.000000
75%      274.000000
max      407.000000
Name: chol, dtype: float64


La variable 'fbs' es categórica y admite únicamente 0 y 1.

In [12]:
# FBS
# ---------------------
# data_train['fbs'].unique() - Contiene valores distintos a 0 o 1.
# data_test['fbs'].unique()
# ---------------------

""" Se han detectado valores distintos a 0 o 1 para el dataset de prueba. """
data_train['fbs'] = data_train['fbs'].apply(lambda x: x if x == 0 or x == 1 else None)
data_test['fbs'] = data_test['fbs'].apply(lambda x: x if x == 0 or x == 1 else None)

data_train['fbs'].unique()

array([ 0., nan,  1.])

In [13]:
data_train['fbs'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 732 entries, 0 to 731
Series name: fbs
Non-Null Count  Dtype  
--------------  -----  
666 non-null    float64
dtypes: float64(1)
memory usage: 5.8 KB


La variable 'restecg' solo admite valores iguales a 0, 1 o 2.

In [14]:
# RESTECG
# ---------------------
# data_train['restecg'].unique()
# data_test['restecg'].unique() - Contiene valores distintos a 0, 1 o 2.
# ---------------------

""" Se han detectado valores distintos a 0, 1 o 2 para el dataset de prueba. """
data_test['restecg'] = data_test['restecg'].apply(lambda x: x if x == 0 or x == 1 or x== 2 else None)
data_train['restecg'] = data_train['restecg'].apply(lambda x: x if x == 0 or x == 1 or x== 2 else None)

data_test['restecg'].unique()

array([ 2.,  0.,  1., nan])

In [15]:
data_test['restecg'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 184 entries, 0 to 183
Series name: restecg
Non-Null Count  Dtype  
--------------  -----  
182 non-null    float64
dtypes: float64(1)
memory usage: 1.6 KB


La variable 'thalach' es la frecuencia cardiaca en reposo, por lo que será una variable continua. No podemos tener valores negativos. No entramos en detalle de si a x valor es posible o se considera nulo.

In [16]:
# THALACH
# ---------------------
# data_train['thalach'].describe() - Contiene valores negativos.
# data_test['thalach'].describe() - Contiene valores negativos.
# ---------------------

""" Se han detectado valores negativos para el dataset de prueba. """
data_train['thalach'] = data_train['thalach'].apply(lambda x: x if x > 0 else None)
data_test['thalach'] = data_test['thalach'].apply(lambda x: x if x > 0 else None)

print("Train:")
print("---------------------")
print(data_train['thalach'].describe())
print("\nTest:")
print("---------------------")
print(data_test['thalach'].describe())

Train:
---------------------
count    688.000000
mean     138.132267
std       25.963443
min       60.000000
25%      120.000000
50%      140.000000
75%      158.250000
max      202.000000
Name: thalach, dtype: float64

Test:
---------------------
count    173.000000
mean     135.150289
std       25.780163
min       67.000000
25%      118.000000
50%      137.000000
75%      154.000000
max      188.000000
Name: thalach, dtype: float64


La variable 'exang' es categórica y admite únicamente valores iguales a 0 o 1.

In [17]:
# EXANG
# ---------------------
# data_train['exang'].unique()
# data_test['exang'].unique() - Contiene valores distintos a 0 o 1.
# ---------------------

""" Se han detectado valores distintos a 0, 1 o 2 para el dataset de prueba. """
data_test['exang'] = data_test['exang'].apply(lambda x: x if x == 0 or x == 1 else None)
data_train['exang'] = data_train['exang'].apply(lambda x: x if x == 0 or x == 1 else None)

data_test['exang'].unique()

array([ 1.,  0., nan])

In [18]:
data_test['exang'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 184 entries, 0 to 183
Series name: exang
Non-Null Count  Dtype  
--------------  -----  
173 non-null    float64
dtypes: float64(1)
memory usage: 1.6 KB


La variable 'oldpeak' es continua. Si que puede tener valores negativos. 

In [19]:
print("Train:")
print("---------------------")
print(data_train['oldpeak'].describe())
print("\nTest:")
print("---------------------")
print(data_test['oldpeak'].describe())

Train:
---------------------
count    683.000000
mean       0.881259
std        1.112960
min       -2.600000
25%        0.000000
50%        0.500000
75%        1.550000
max        6.200000
Name: oldpeak, dtype: float64

Test:
---------------------
count    171.000000
mean       0.854386
std        0.994941
min       -0.700000
25%        0.000000
50%        0.600000
75%        1.500000
max        4.000000
Name: oldpeak, dtype: float64


La variable 'slope' es categórica y solo admite valores iguales a 1, 2 o 3.

In [20]:
# SLOPE
# ---------------------
# data_train['slope'].unique() - Contiene valores distintos a 1, 2 o 3.
# data_test['slope'].unique() - Contiene valores distintos a 1, 2 o 3.
# ---------------------

""" Se han detectado valores distintos a 0, 1 o 2 para el dataset de prueba. """
data_test['slope'] = data_test['slope'].apply(lambda x: x if x == 2 or x == 1 or x == 3 else None)
data_train['slope'] = data_train['slope'].apply(lambda x: x if x == 2 or x == 1 or x == 3 else None)

data_train['slope'].unique()

array([ 1., nan,  3.,  2.])

In [21]:
data_train['slope'].describe()

count    485.000000
mean       1.777320
std        0.616493
min        1.000000
25%        1.000000
50%        2.000000
75%        2.000000
max        3.000000
Name: slope, dtype: float64

La variable 'ca' es categórica y admite solo valores iguales a 0, 1, 2 o 3.

In [22]:
# CA
# ---------------------
# data_train['ca'].unique() - Contiene valores distintos a 0, 1, 2 o 3.
# data_test['ca'].unique() - Contiene valores distintos a 0, 1, 2 o 3.
# ---------------------

""" Se han detectado valores distintos a 0, 1 o 2 para el dataset de prueba. """
data_test['ca'] = data_test['ca'].apply(lambda x: x if x == 2 or x == 1 or x == 3 or x == 0 else None)
data_train['ca'] = data_train['ca'].apply(lambda x: x if x == 2 or x == 1 or x == 3 or x == 0 else None)

data_train['ca'].unique()

array([ 1., nan,  0.,  2.,  3.])

In [23]:
data_train['ca'].describe()

count    253.000000
mean       0.691700
std        0.946909
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        3.000000
Name: ca, dtype: float64

La variable 'thal' es categórica y admite valores iguales a 3, 6 o 7.

In [24]:
# THAL
# ---------------------
# data_train['thal'].unique() - Contiene valores distintos a 0, 1, 2 o 3.
# data_test['thal'].unique() - Contiene valores distintos a 0, 1, 2 o 3.
# ---------------------

""" Se han detectado valores distintos a 0, 1 o 2 para el dataset de prueba. """
data_test['thal'] = data_test['thal'].apply(lambda x: x if x == 6 or x == 7 or x == 3 else None)
data_train['thal'] = data_train['thal'].apply(lambda x: x if x == 6 or x == 7 or x == 3 else None)

data_train['thal'].unique()

array([ 3.,  7., nan,  6.])

In [25]:
data_test['thal'].describe()

count    80.000000
mean      4.937500
std       1.924813
min       3.000000
25%       3.000000
50%       6.000000
75%       7.000000
max       7.000000
Name: thal, dtype: float64

Por último, transformamos todas las variables a su tipo correspondiente.

In [26]:
# Númericas - Continuas
data_train['age'] = data_train['age'].astype(int)
data_train['trestbps'] = data_train['trestbps'].astype(float)
data_train['chol'] = data_train['chol'].astype(float)
data_train['thalach'] = data_train['thalach'].astype(float)
data_train['oldpeak'] = data_train['oldpeak'].astype(float)

# Categóricas
data_train['sex'] = data_train['sex'].astype('category')
data_train['cp'] = data_train['cp'].astype('category')
data_train['fbs'] = data_train['fbs'].astype('category')
data_train['restecg'] = data_train['restecg'].astype('category')
data_train['exang'] = data_train['exang'].astype('category')
data_train['slope'] = data_train['slope'].astype('category')
data_train['ca'] = data_train['ca'].astype('category')
data_train['thal'] = data_train['thal'].astype('category')
data_train['label'] = data_train['label'].astype('category')

data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       732 non-null    int64   
 1   sex       732 non-null    category
 2   cp        732 non-null    category
 3   trestbps  684 non-null    float64 
 4   chol      576 non-null    float64 
 5   fbs       666 non-null    category
 6   restecg   732 non-null    category
 7   thalach   688 non-null    float64 
 8   exang     688 non-null    category
 9   oldpeak   683 non-null    float64 
 10  slope     485 non-null    category
 11  ca        253 non-null    category
 12  thal      353 non-null    category
 13  label     732 non-null    category
dtypes: category(9), float64(4), int64(1)
memory usage: 36.5 KB


In [27]:
# Númericas - Continuas
data_test['age'] = data_test['age'].astype(int)
data_test['trestbps'] = data_test['trestbps'].astype(float)
data_test['chol'] = data_test['chol'].astype(float)
data_test['thalach'] = data_test['thalach'].astype(float)
data_test['oldpeak'] = data_test['oldpeak'].astype(float)

# Categóricas
data_test['sex'] = data_test['sex'].astype('category')
data_test['cp'] = data_test['cp'].astype('category')
data_test['fbs'] = data_test['fbs'].astype('category')
data_test['restecg'] = data_test['restecg'].astype('category')
data_test['exang'] = data_test['exang'].astype('category')
data_test['slope'] = data_test['slope'].astype('category')
data_test['ca'] = data_test['ca'].astype('category')
data_test['thal'] = data_test['thal'].astype('category')

#### TRATAMIENTO DE LOS VALORES NaN.

In [28]:
total_filas_train = len(data_train)
total_filas_test = len(data_test)

# Calcular el porcentaje de NaN por columna
porcentaje_nan_por_columna_train = (data_train.isna().sum() / total_filas_train) * 100
porcentaje_nan_por_columna_test = (data_test.isna().sum() / total_filas_test) * 100
print("Porcentaje de valores nulos por fila.")
print("Train:")
print("---------------------")
print(porcentaje_nan_por_columna_train)
print("\nTest:")
print("---------------------")
print(porcentaje_nan_por_columna_test)

Porcentaje de valores nulos por fila.
Train:
---------------------
age          0.000000
sex          0.000000
cp           0.000000
trestbps     6.557377
chol        21.311475
fbs          9.016393
restecg      0.000000
thalach      6.010929
exang        6.010929
oldpeak      6.693989
slope       33.743169
ca          65.437158
thal        51.775956
label        0.000000
dtype: float64

Test:
---------------------
age          0.000000
sex          0.000000
cp           0.000000
trestbps     6.521739
chol        24.456522
fbs         12.500000
restecg      1.086957
thalach      5.978261
exang        5.978261
oldpeak      7.065217
slope       33.152174
ca          70.108696
thal        56.521739
dtype: float64


Se va a estudiar la opción de eliminar las columnas 'ca' y 'thal' ya que representan la mayor cantidad de NaN (superior al 40%).

In [29]:
data_train_v2 = data_train.drop(['ca', 'thal'], axis=1)
data_test_v2 = data_test.drop(['ca', 'thal'], axis=1)

In [30]:
data_train_v2.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,label
0,51,1.0,1.0,125.0,213.0,0.0,2.0,125.0,1.0,1.4,1.0,0.0
1,54,1.0,3.0,120.0,237.0,0.0,0.0,150.0,1.0,1.5,,2.0
2,63,1.0,4.0,140.0,,,2.0,149.0,0.0,2.0,1.0,2.0
3,52,0.0,2.0,140.0,,0.0,0.0,140.0,0.0,0.0,,0.0
4,55,1.0,4.0,140.0,217.0,0.0,0.0,111.0,1.0,5.6,3.0,3.0


In [31]:
data_test_v2.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope
0,57,1.0,4.0,156.0,173.0,0.0,2.0,119.0,1.0,3.0,3.0
1,52,1.0,2.0,160.0,196.0,0.0,0.0,165.0,0.0,0.0,
2,48,1.0,2.0,100.0,,0.0,0.0,100.0,0.0,0.0,
3,62,1.0,4.0,115.0,,,0.0,128.0,1.0,2.5,3.0
4,51,1.0,3.0,110.0,175.0,0.0,0.0,123.0,0.0,0.6,1.0


In [32]:
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [33]:
print("Valores nulos por columna:")
print("--------------------------")
print(data_train_v2.isnull().sum())

Valores nulos por columna:
--------------------------
age           0
sex           0
cp            0
trestbps     48
chol        156
fbs          66
restecg       0
thalach      44
exang        44
oldpeak      49
slope       247
label         0
dtype: int64


En un primer lugar vamos a tratar los valores NaN con el método KNN.

In [34]:
imputer = KNNImputer(n_neighbors=5)
scaler = StandardScaler()

data_train_v2_scaled = data_train_v2.copy()
label = data_train_v2_scaled['label']

num_cols = ['trestbps', 'chol', 'thalach', 'oldpeak']
X_numeric = data_train_v2[num_cols]

X_scaled = scaler.fit_transform(X_numeric)
data_train_v2_scaled = data_train_v2_scaled.reset_index()
data_train_v2_scaled.loc[:, num_cols] = X_scaled

X_imputed = imputer.fit_transform(data_train_v2_scaled)
X_imputed_df = pd.DataFrame(X_imputed, columns=data_train_v2_scaled.columns)
X_imputed_df.drop('index', axis=1, inplace=True)

In [35]:
X_imputed_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,label
0,51.0,1.0,1.0,-0.386893,-0.563639,0.0,2.0,-0.506166,1.0,0.466433,1.0,0.0
1,54.0,1.0,3.0,-0.656763,-0.163673,0.0,0.0,0.457426,1.0,0.556349,1.8,2.0
2,63.0,1.0,4.0,0.422718,0.149634,0.0,2.0,0.418883,0.0,1.005931,1.0,2.0
3,52.0,0.0,2.0,0.422718,-0.556973,0.0,0.0,0.071989,0.0,-0.792396,2.2,0.0
4,55.0,1.0,4.0,0.422718,-0.496978,0.0,0.0,-1.045778,1.0,4.242918,3.0,3.0


In [36]:
print("Valores nulos por columna:")
print("--------------------------")
print(X_imputed_df.isnull().sum())

Valores nulos por columna:
--------------------------
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
label       0
dtype: int64


In [37]:
data_test_v2_scaled = data_test_v2.copy()

num_cols_test = ['trestbps', 'chol', 'thalach', 'oldpeak']
X_numeric_test = data_test_v2_scaled[num_cols]

X_scaled_test = scaler.fit_transform(X_numeric_test)
data_test_v2_scaled = data_test_v2_scaled.reset_index()
data_test_v2_scaled.loc[:, num_cols] = X_scaled_test

X_imputed_test = imputer.fit_transform(data_test_v2_scaled)
X_imputed_test_df = pd.DataFrame(X_imputed_test, columns=data_test_v2_scaled.columns)
X_imputed_test_df.drop('index', axis=1, inplace=True)

X_imputed_test_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope
0,57.0,1.0,4.0,1.256224,-1.407453,0.0,2.0,-0.62828,1.0,2.162858,3.0
1,52.0,1.0,2.0,1.47289,-0.967521,0.0,0.0,1.161217,0.0,-0.861252,1.8
2,48.0,1.0,2.0,-1.777105,-0.956044,0.0,0.0,-1.36742,0.0,-0.861252,1.4
3,62.0,1.0,4.0,-0.964606,-0.642353,0.2,0.0,-0.278161,1.0,1.658839,3.0
4,51.0,1.0,3.0,-1.235439,-1.369198,0.0,0.0,-0.472672,0.0,-0.25643,1.0


In [38]:
print("Valores nulos por columna:")
print("--------------------------")
print(X_imputed_test_df.isnull().sum())

Valores nulos por columna:
--------------------------
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
dtype: int64


#### ENTRENAMIENTO DEL MODELO.

##### 01. RANDOM FOREST BÁSICO

In [39]:
from sklearn.ensemble import RandomForestClassifier

X_train = X_imputed_df.drop('label', axis=1)
y_train = X_imputed_df['label']

# Inicializar el modelo de Bosques Aleatorios
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, min_samples_split=20)

# Entrenar el modelo
rf_model.fit(X_train, y_train)

In [40]:
y_pred = rf_model.predict(X_imputed_test_df)

df = pd.DataFrame({'ID': range(len(y_pred)), 'label': y_pred})
df['label'] = df['label'].astype(int)
df.to_csv('result.csv', index=False)

El resultado obtenido es de 0,5213.

##### 02. XGBoost

Antes de usar este algoritmo, vamos a poner las variables como 0 y 1 para el label, siendo 0 'sano' y 1 'enfermo'.

In [41]:
xgb_data_train = X_imputed_df.copy()
xgb_data_test = X_imputed_test_df.copy()

xgb_data_train['label'] = xgb_data_train['label'].apply(lambda x: 0 if x == 0 else 1)

X_train = xgb_data_train.drop('label', axis=1)
y_train = xgb_data_train['label']

In [42]:
import xgboost as xgb

# Preparar los conjuntos de datos en el formato DMatrix que optimiza XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_imputed_test_df)

# Configurar los parámetros del modelo
# Estos son parámetros básicos, podrías necesitar ajustarlos o agregar más según tu caso específico
params = {
    'objective': 'binary:logistic',  # Cambia esto si tu variable objetivo es multiclase
    'max_depth': 6,  # Profundidad máxima de cada árbol
    'min_child_weight': 1,  # Mínimo peso (o número de observaciones) necesario para hacer una división adicional
    'subsample': 0.8,  # Porcentaje de muestras usadas por árbol. Esto puede prevenir el overfitting.
    'colsample_bytree': 0.8,  # Porcentaje de características usadas por árbol. Esto puede ser como el max_features en RandomForest.
    'n_estimators': 100,  # Número de árboles. Similar al parámetro en RandomForest.
    'learning_rate': 0.1,  # Tasa de aprendizaje
    'eval_metric': 'logloss',  # Métrica de evaluación para clasificación binaria
    'seed': 42  # Semilla para reproducibilidad
}

# Entrenar el modelo
model_xgb = xgb.train(params, dtrain, num_boost_round=100)

# Predecir en el conjunto de test
y_pred_xgb = model_xgb.predict(dtest)
y_pred_xgb = (y_pred_xgb > 0.5).astype(int)  # Convertir probabilidades a clases binarias

# Crear DataFrame para la salida
df_xgb = pd.DataFrame({'ID': range(len(y_pred_xgb)), 'label': y_pred_xgb})
df_xgb['label'] = df_xgb['label'].astype(int)
df_xgb.to_csv('result_xgb.csv', index=False)

Parameters: { "n_estimators" } are not used.



In [43]:
df_xgb.head()

Unnamed: 0,ID,label
0,0,1
1,1,0
2,2,0
3,3,1
4,4,1


In [44]:
df.head()

Unnamed: 0,ID,label
0,0,3
1,1,0
2,2,0
3,3,1
4,4,0


In [45]:
import numpy as np

df_final = df.copy()
df_final['label'] = np.where(df_xgb['label'] == 0, df_xgb['label'], df['label'])

df_final

Unnamed: 0,ID,label
0,0,3
1,1,0
2,2,0
3,3,1
4,4,0
...,...,...
179,179,2
180,180,0
181,181,0
182,182,0


In [46]:
df_final.to_csv('result_final.csv', index=False)