# Create `True_withMeta.csv` Notebook
This notebook processes test data and metadata from MIMIC, CXP, and NIH datasets to create a unified file `True_withMeta.csv`.
The `Age` and `Sex` columns are directly extracted from `test_df`, while `race`, `insurance`, and `ethnicity` (mapped to `race`) are merged from the respective metadata.

In [4]:
# Import Libraries
import pandas as pd

## Load Data

In [5]:
# Paths to datasets
test_dataset_path = "./data_mevis/ALLData/preprocessed_test_df_1.csv"
cxp_metadata_path = "./data_mevis/CXP/demographics_CXP.csv"
mimic_metadata_path = "./data_mevis/MIMIC/demographics_MIMIC.csv"

# Load datasets
test_df = pd.read_csv(test_dataset_path)
print("Test dataset loaded:")
display(test_df.head())

cxp_metadata = pd.read_csv(cxp_metadata_path)
print("CXP metadata loaded:")
display(cxp_metadata.head())

mimic_metadata = pd.read_csv(mimic_metadata_path)
print("MIMIC metadata loaded:")
display(mimic_metadata.head())

Test dataset loaded:


Unnamed: 0,subject_id,Jointpath,Sex,Age,No Finding,Atelectasis,Cardiomegaly,Effusion,Pneumonia,Pneumothorax,Consolidation,Edema
0,3,/bigdata/andrei_thesis/CXP_data/images/patient...,M,40-60,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
1,7,/bigdata/andrei_thesis/CXP_data/images/patient...,M,60-80,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
2,7,/bigdata/andrei_thesis/CXP_data/images/patient...,M,60-80,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,13,/bigdata/andrei_thesis/CXP_data/images/patient...,M,20-40,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13,/bigdata/andrei_thesis/CXP_data/images/patient...,M,20-40,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


CXP metadata loaded:


Unnamed: 0,subject_id,race,ethnicity,insurance_type
0,42142,White,Non-Hispanic/Non-Latino,Private Insurance
1,4528,White,Non-Hispanic/Non-Latino,Private Insurance
2,55652,White,Non-Hispanic/Non-Latino,Medicare
3,53157,White,Non-Hispanic/Non-Latino,Medicare
4,11162,Asian,Non-Hispanic/Non-Latino,Medicare


MIMIC metadata loaded:


Unnamed: 0,subject_id,insurance,language,marital_status,ethnicity,gender,anchor_age
0,12427812,Other,ENGLISH,,UNKNOWN,F,35
1,14029832,Other,ENGLISH,,OTHER,F,55
2,14495017,Other,?,,WHITE,M,0
3,13676048,Other,?,MARRIED,WHITE,F,33
4,13831972,Medicaid,ENGLISH,SINGLE,WHITE,F,46


## Normalize Metadata
Normalize the metadata to have consistent columns. Since `Age` and `Sex` are already present in `test_df`, we'll focus on `race` and `insurance`.

In [6]:
# Normalize CXP metadata
cxp_metadata_normalized = cxp_metadata.rename(columns={
    "insurance_type": "insurance",
    "race": "race"
})
cxp_metadata_normalized["ethnicity"] = None  # Add placeholder for consistency

# Normalize MIMIC metadata
mimic_metadata_normalized = mimic_metadata.rename(columns={
    "ethnicity": "race",  # Treat MIMIC's ethnicity as race
    "insurance": "insurance"
})

# Keep only required columns
cxp_metadata_normalized = cxp_metadata_normalized[["subject_id", "race", "insurance"]]
mimic_metadata_normalized = mimic_metadata_normalized[["subject_id", "race", "insurance"]]

print("Normalized CXP metadata:")
display(cxp_metadata_normalized.head())

print("Normalized MIMIC metadata:")
display(mimic_metadata_normalized.head())

Normalized CXP metadata:


Unnamed: 0,subject_id,race,insurance
0,42142,White,Private Insurance
1,4528,White,Private Insurance
2,55652,White,Medicare
3,53157,White,Medicare
4,11162,Asian,Medicare


Normalized MIMIC metadata:


Unnamed: 0,subject_id,race,insurance
0,12427812,UNKNOWN,Other
1,14029832,OTHER,Other
2,14495017,WHITE,Other
3,13676048,WHITE,Other
4,13831972,WHITE,Medicaid


## Combine Metadata
Concatenate normalized metadata and resolve conflicts by prioritizing CXP over MIMIC for overlapping `subject_id`s.

In [8]:
# Concatenate metadata
all_metadata = pd.concat([mimic_metadata_normalized, cxp_metadata_normalized], ignore_index=True)

# Resolve conflicts by prioritizing CXP over MIMIC
all_metadata = all_metadata.groupby(["subject_id"]).first().reset_index()

print("Combined metadata with priorities resolved:")
display(all_metadata.head())

Combined metadata with priorities resolved:


Unnamed: 0,subject_id,race,insurance
0,1,Other,Medicare
1,2,White,Unknown
2,3,White,Private Insurance
3,4,Black,Medicaid
4,5,White,Private Insurance


## Merge with Test Data
Merge the resolved metadata with `test_df` to produce the final output.

In [9]:
# Merge metadata with test_df
final_df = pd.merge(test_df, all_metadata, on="subject_id", how="left")

# Save the result
output_path = "True_withMeta.csv"
final_df.to_csv(output_path, index=False)
print(f"Merged dataset saved to {output_path}.")

Merged dataset saved to True_withMeta.csv.


## Preview Final Output

In [10]:
# Preview the first few rows of the final merged dataset
display(final_df.head())

Unnamed: 0,subject_id,Jointpath,Sex,Age,No Finding,Atelectasis,Cardiomegaly,Effusion,Pneumonia,Pneumothorax,Consolidation,Edema,race,insurance
0,3,/bigdata/andrei_thesis/CXP_data/images/patient...,M,40-60,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,White,Private Insurance
1,7,/bigdata/andrei_thesis/CXP_data/images/patient...,M,60-80,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,Other,Private Insurance
2,7,/bigdata/andrei_thesis/CXP_data/images/patient...,M,60-80,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,Other,Private Insurance
3,13,/bigdata/andrei_thesis/CXP_data/images/patient...,M,20-40,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Other,Private Insurance
4,13,/bigdata/andrei_thesis/CXP_data/images/patient...,M,20-40,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Other,Private Insurance
