# TRUSTWORTHY AI: Data Collection and Metadata

#### Script to fit the data into the TAI 

The aim of this script is to prepare the necessary metadata and, in a greater manner, to simplify the dataset "Heart Disease" in order to adapt it to the study of a trustworthy AI.

We start by preparing the working environment. Import all the necessary libraries.

In [1]:
import pandas as pd
import numpy as np

To simplify the study, we will make certain assumptions.

In [2]:
# Load the diabetes dataset
path = 'C:\\Users\\carlo\\OneDrive - UPV\\ESCRITORIO CARLOS\\UPV\\BECA COLABORACIÓN\\Datasets\\Heart Disease\\' 
file_name = 'heart_statlog_cleveland_hungary_final.csv'
data = pd.read_csv(path+file_name)

data.head()

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
0,40,1,2,140,289,0,0,172,0,0.0,1,0
1,49,0,3,160,180,0,0,156,0,1.0,2,1
2,37,1,2,130,283,0,1,98,0,0.0,1,0
3,48,0,4,138,214,0,0,108,1,1.5,2,1
4,54,1,3,150,195,0,0,122,0,0.0,1,0


To clarify the features, we will map the categories of each nominal variable and transform them into their string traductions. We want the data raw so that we can choose the most appropriate encoding method.

In [3]:
print(data['sex'].value_counts())

map_dict_sex = {
    0: 'female', 
    1: 'male',
}

data['sex'] = data['sex'].replace(map_dict_sex)

print("\n",data['sex'].value_counts())

sex
1    909
0    281
Name: count, dtype: int64

 sex
male      909
female    281
Name: count, dtype: int64


In [4]:
print(data['chest pain type'].value_counts())

map_dict_pain = {
    1: 'typical angina', 
    2: 'atypical angina',
    3: 'non-anginal pain',
    4: 'asymptomatic',
}

data['chest pain type'] = data['chest pain type'].replace(map_dict_pain)

print("\n",data['chest pain type'].value_counts())

chest pain type
4    625
3    283
2    216
1     66
Name: count, dtype: int64

 chest pain type
asymptomatic        625
non-anginal pain    283
atypical angina     216
typical angina       66
Name: count, dtype: int64


In [5]:
print(data['resting ecg'].value_counts())

map_dict_ecg = {
    0: 'normal', 
    1: 'ST-T abnormality', 
    2: 'left ventricular hypertrophy',
}

data['resting ecg'] = data['resting ecg'].replace(map_dict_ecg)

print("\n",data['resting ecg'].value_counts())

resting ecg
0    684
2    325
1    181
Name: count, dtype: int64

 resting ecg
normal                          684
left ventricular hypertrophy    325
ST-T abnormality                181
Name: count, dtype: int64


In [6]:
print(data['ST slope'].value_counts())

data = data[data['ST slope'] != 0]
map_dict_slope = {
    1: 'upsloping', 
    2: 'flat', 
    3: 'downsloping',
}

data['ST slope'] = data['ST slope'].replace(map_dict_slope)

print("\n",data['ST slope'].value_counts())

ST slope
2    582
1    526
3     81
0      1
Name: count, dtype: int64

 ST slope
flat           582
upsloping      526
downsloping     81
Name: count, dtype: int64


In [7]:
file_name = 'dataset_heart_disease_full.xlsx'
data.to_excel(path + file_name, index=False)

print(f"The data frame has been saved to {file_name}.")

The data frame has been saved to dataset_heart_disease_full.xlsx.


In [8]:
# Define Metadata
dataset = data
output = "target"
positive_class = ""
feat_id = ""
feat_sensitive = ["sex"]
feat_types = {
    "age": "numerical",
    "sex": "categorical",
    "chest pain type": "categorical",
    "resting bp s": "numerical",
    "cholesterol": "numerical",
    "fasting blood sugar": "categorical",
    "resting ecg": "categorical",
    "max heart rate":"numerical",
    "exercise angina":"categorical",
    "oldpeak":"numerical",
    "ST slope":"categorical",
}
feat2balance = ["sex"]
data_provenance = "The dataset consists of 1190 records of patients from US, UK, Switzerland and Hungary."