# TRUSTWORTHY AI: Data Collection and Metadata

#### Script to fit the data into the TAI 

The aim of this script is to prepare the necessary metadata and, in a greater manner, to simplify the dataset "Heart Disease" in order to adapt it to the study of a trustworthy AI.

We start by preparing the working environment. Import all the necessary libraries.

In [125]:
import pandas as pd
import numpy as np
import json

To simplify the study, we will make certain assumptions.

In [126]:
# Load the diabetes dataset
path = 'C:\\Users\\carlo\\OneDrive - UPV\\ESCRITORIO CARLOS\\UPV\\BECA COLABORACIÓN\\Datasets\\Heart Disease\\' 
file_name = 'heart_statlog_cleveland_hungary_final.csv'
data = pd.read_csv(path+file_name)
data.columns = data.columns.str.replace(' ', '_')
data.head()

Unnamed: 0,age,sex,chest_pain_type,resting_bp_s,cholesterol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_angina,oldpeak,ST_slope,target
0,40,1,2,140,289,0,0,172,0,0.0,1,0
1,49,0,3,160,180,0,0,156,0,1.0,2,1
2,37,1,2,130,283,0,1,98,0,0.0,1,0
3,48,0,4,138,214,0,0,108,1,1.5,2,1
4,54,1,3,150,195,0,0,122,0,0.0,1,0


To clarify the features, we will map the categories of each nominal variable and transform them into their string traductions. We want the data raw so that we can choose the most appropriate encoding method.

In [127]:
print(data['sex'].value_counts())

map_dict_sex = {
    0: 'female', 
    1: 'male',
}

data['sex'] = data['sex'].replace(map_dict_sex)

print("\n",data['sex'].value_counts())

sex
1    909
0    281
Name: count, dtype: int64

 sex
male      909
female    281
Name: count, dtype: int64


In [128]:
print(data['chest_pain_type'].value_counts())

map_dict_pain = {
    1: 'typical angina', 
    2: 'atypical angina',
    3: 'non-anginal pain',
    4: 'asymptomatic',
}

data['chest_pain_type'] = data['chest_pain_type'].replace(map_dict_pain)

print("\n",data['chest_pain_type'].value_counts())

chest_pain_type
4    625
3    283
2    216
1     66
Name: count, dtype: int64

 chest_pain_type
asymptomatic        625
non-anginal pain    283
atypical angina     216
typical angina       66
Name: count, dtype: int64


In [129]:
print(data['resting_ecg'].value_counts())

map_dict_ecg = {
    0: 'normal', 
    1: 'ST-T abnormality', 
    2: 'LV hypertrophy',
}

data['resting_ecg'] = data['resting_ecg'].replace(map_dict_ecg)

print("\n",data['resting_ecg'].value_counts())

resting_ecg
0    684
2    325
1    181
Name: count, dtype: int64

 resting_ecg
normal              684
LV hypertrophy      325
ST-T abnormality    181
Name: count, dtype: int64


In [130]:
print(data['ST_slope'].value_counts())

data = data[data['ST_slope'] != 0] # Omit value 0
map_dict_slope = {
    1: 'upsloping', 
    2: 'flat', 
    3: 'downsloping',
}

data['ST_slope'] = data['ST_slope'].replace(map_dict_slope)

print("\n",data['ST_slope'].value_counts())

ST_slope
2    582
1    526
3     81
0      1
Name: count, dtype: int64

 ST_slope
flat           582
upsloping      526
downsloping     81
Name: count, dtype: int64


In [131]:
logic = data['resting_bp_s']==0
print(logic.value_counts(), '\n')
data = data[data['resting_bp_s'] != 0] # Omit value 0

logic = data['cholesterol']==0
print(logic.value_counts(), '\n')
data['cholesterol'].replace(0, np.nan, inplace=True)

resting_bp_s
False    1188
True        1
Name: count, dtype: int64 

cholesterol
False    1017
True      171
Name: count, dtype: int64 



In [132]:
file_name = 'dataset_simplified.csv'
data.to_csv(path + file_name, index=False)

print(f"The data frame has been saved to {file_name}.")

The data frame has been saved to dataset_simplified.csv.


In [133]:
# Define Metadata
metadata = {
    "output": "target",
    "positive_class": 1,
    "feat_id": "",
    "feat_sensitive": ["sex"],
    "feat_types": {
        "age": "numerical",
        "sex": "categorical",
        "chest_pain_type": "categorical",
        "resting_bp_s": "numerical",
        "cholesterol": "numerical",
        "fasting_blood_sugar": "categorical",
        "resting_ecg": "categorical",
        "max_heart_rate":"numerical",
        "exercise_angina":"categorical",
        "oldpeak":"numerical",
        "ST_slope":"categorical",
    },
    "feat2balance": ["sex"],
    # Text nformation and variable with the provenance
    "data_provenance": ["The dataset consists of 1190 records of patients from US, UK, Switzerland and Hungary.",""],
    # Text information and variable with the acquisition date
    "acquisition_date": ["Empty",""]
}

file_name = 'metadata.json'

# Save metadata to a JSON file at the specified path
with open(path + file_name, 'w') as json_file:
    json.dump(metadata, json_file, indent=4)

print(f"The Metadata has been saved to {file_name}.")

The Metadata has been saved to metadata.json.


&copy; 2024 Carlos de Manuel & Carlos Sáez - Universitat Politècnica de València 2024
