# TRUSTWORTHY AI: Data Collection and Metadata

#### Script to fit the data into the TAI 

The aim of this script is to prepare the necessary metadata and, in a greater manner, to simplify the dataset "Diabetes 130 US hospitals for years 1999-2008" in order to adapt it to the study of a trustworthy AI.

We start by preparing the working environment. Import all the necessary libraries.

In [13]:
import pandas as pd
import numpy as np
import json

To simplify the study, we will make certain assumptions.

In [14]:
# Load the diabetes dataset
path = 'C:\\Users\\carlo\\OneDrive - UPV\\ESCRITORIO CARLOS\\UPV\\BECA COLABORACIÓN\\Datasets\\Diabetes\\' 
file_name = 'dataset_diabetes.xlsx'
data = pd.read_excel(path + file_name)

data.replace('?', np.nan, inplace=True)# 'inplace = True', changes applied directly to the DataFrame
data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


First of all, we are going to delete the unnecessary characteristics, that is, those that refer to the variability of the dosage of different medicines, because if there is a variation in the dosage, we will notice it in the variable 'change'.

In [15]:
data.drop(data.columns[22:47], axis=1, inplace=True)

The dataset includes multiple rows with the same 'patient_nbr', which may create a correlation risk between patient visits and instances. We assume that 'time_in_hospital' is a relevant feature for readmission and will retain the row with the maximum time.

In [16]:
idx = data.groupby('patient_nbr')['time_in_hospital'].idxmax()
data = data.loc[idx]

Looking at the sensitive variable 'race', we notice that there are categories with a huge lack of instances. Therefore, we will simplify the characteristic by grouping the races 'Asian' and 'Hispanic' into 'Other'.

In [17]:
print(data['race'].value_counts())
data['race'] = data['race'].replace(['Hispanic', 'Asian'], 'Other')
print('\n',data['race'].value_counts())

race
Caucasian          53513
AfricanAmerican    12903
Hispanic            1512
Other               1170
Asian                500
Name: count, dtype: int64

 race
Caucasian          53513
AfricanAmerican    12903
Other               3182
Name: count, dtype: int64


Similarly, the features 'discharge_disposition_id', 'admission_type_id', and 'admission_source_id' have too many levels that could be grouped more simply.

In [18]:
print(data['discharge_disposition_id'].value_counts())

# Drop 'discharge_disposition_id' codes associated with 'Expired'
data = data[~data['discharge_disposition_id'].isin([11, 19, 20, 21])]

# Define 'discharge_disposition_id' map function
def map_discharge_category(val):
    if val in [1, 6, 8]:
        return 'Home'
    elif val in [2, 3, 4, 5, 10, 22, 23, 24, 30, 27, 29]:
        return 'Transferred'
    elif val in [7, 12]:
        return 'Others'
    elif val in [13, 14, 15, 16, 17, 28]:
        return 'Hospice'
    elif val in [18, 25, 26]:
        return 'Unknown'

data['discharge_disposition_id'] = data['discharge_disposition_id'].apply(map_discharge_category)
print('\n',data['discharge_disposition_id'].value_counts())


discharge_disposition_id
1     43092
3      9447
6      8785
18     2476
2      1465
22     1370
11     1266
5       882
25      674
4       546
7       380
23      302
13      295
14      266
8        81
28       78
15       40
24       27
9        12
17       10
10        6
19        6
16        5
27        4
12        2
20        1
Name: count, dtype: int64

 discharge_disposition_id
Home           51958
Transferred    14049
Unknown         3150
Hospice          694
Others           382
Name: count, dtype: int64


In [19]:
print(data['admission_source_id'].value_counts())

# Define 'admission_source_id' map function
def map_admission_source_category(val):
    if val in [1, 2, 3]:
        return 'Referral'
    elif val in [4, 5, 6, 10, 18, 19, 22, 25, 26]:
        return 'Transfer'
    elif val in [7, 8]:
        return 'Others'
    elif val in [9, 15, 17, 20, 21]:
        return 'Unknown'
    elif val in [11, 12, 13, 14, 23, 24]:
        return 'Birth'

data['admission_source_id'] = data['admission_source_id'].apply(map_admission_source_category)

print("\n",data['admission_source_id'].value_counts())
data = data[data['admission_source_id'] != 'Birth']

admission_source_id
7     37618
1     21476
17     4819
4      2666
6      1797
2       894
5       572
20      146
3       127
9        94
8        12
22        9
10        8
14        2
11        2
25        2
13        1
Name: count, dtype: int64

 admission_source_id
Others      37630
Referral    22497
Unknown      5059
Transfer     5054
Birth           5
Name: count, dtype: int64


In [20]:
print(data['admission_type_id'].value_counts())

# Drop 'admission_type_id' rows associated with 'Newborn' and 'Trauma Center'
data = data[~data['admission_type_id'].isin([4, 7])]

map_dict = {
    1: 'Emergency', 
    2: 'Urgent',  
    3: 'Elective',  
    5: 'Unknown',  # Not Available
    6: 'Unknown',  # NULL
    8: 'Unknown'  # Not Mapped
}

data['admission_type_id'] = data['admission_type_id'].replace(map_dict)

print("\n",data['admission_type_id'].value_counts())

admission_type_id
1    35871
3    13795
2    12932
6     4385
5     2936
8      295
7       18
4        8
Name: count, dtype: int64

 admission_type_id
Emergency    35871
Elective     13795
Urgent       12932
Unknown       7616
Name: count, dtype: int64


In [21]:
data.rename(columns={'admission_type_id': 'admission_type', 
                     'discharge_disposition_id': 'discharge_disposition',
                    'admission_source_id': 'admission_source'},
            inplace=True)


We take the median of the age interval in order to convert the categorical feature into a numerical one.

In [22]:
print(data['age'].value_counts())

map_dict_age = {
    '[0-10)': 5,
    '[10-20)': 15,
    '[20-30)': 25,
    '[30-40)': 35,
    '[40-50)': 45,
    '[50-60)': 55,
    '[60-70)': 65,
    '[70-80)': 75,
    '[80-90)': 85,
    '[90-100)': 95
}

data['age'] = data['age'].replace(map_dict_age)

print("\n",data['age'].value_counts())

age
[70-80)     17801
[60-70)     15685
[50-60)     12310
[80-90)     11330
[40-50)      6772
[30-40)      2671
[90-100)     1852
[20-30)      1115
[10-20)       526
[0-10)        152
Name: count, dtype: int64

 age
75    17801
65    15685
55    12310
85    11330
45     6772
35     2671
95     1852
25     1115
15      526
5       152
Name: count, dtype: int64


After these pre-processing methods, we save the dataset to start the pipeline steps

In [23]:
file_name = 'dataset_simplified.csv'
data.to_csv(path + file_name, index=False)

print(f"The data frame has been saved to {file_name}.")

The data frame has been saved to dataset_simplified.csv.


In [24]:
# Define Metadata
metadata = {
    "output": "readmitted",
    "positive_class": "<30",
    "feat_id": ["encounter_id", "patient_nbr"],
    "feat_sensitive": ["race", "gender"],
    "feat_types": {
        "race": "categorical",
        "gender": "categorical",
        "age": "numerical",
        "weight": "categorical",
        "admission_type": "categorical",
        "discharge_disposition": "categorical",
        "admission_source": "categorical",
        "time_in_hospital": "numerical",
        "payer_code": "categorical",
        "medical_specialty": "categorical",
        "num_lab_procedures": "numerical",
        "num_procedures": "numerical",
        "num_medications": "numerical",
        "number_outpatient": "numerical",
        "number_emergency": "numerical",
        "number_inpatient": "numerical",
        "diag_1": "categorical",
        "diag_2": "categorical",
        "diag_3": "categorical",
        "number_diagnoses": "numerical",
        "change": "categorical",
        "diabetesMed": "categorical"
    },
    "feat2balance": ["race"],
    # Text nformation and variable with the provenance
    "data_provenance": ["A Health Facts database that represents 10 years (1999-2008) of clinical care at 130 hospitals in United States.","admission_type"],
    # Text information and variable with the acquisition date
    "acquisition_date": ["Empty",""]
}

file_name = 'metadata.json'

# Save metadata to a JSON file at the specified path
with open(path + file_name, 'w') as json_file:
    json.dump(metadata, json_file, indent=4)

print(f"The Metadata has been saved to {file_name}.")

The Metadata has been saved to metadata.json.


&copy; 2024 Carlos de Manuel & Carlos Sáez - Universitat Politècnica de València 2024
