Correcting Racial Bias in Pulse Oximeter Measurements of Blood Oxygen Saturation

Erdős Institute Data Science Boot Camp, Spring 2023.

🎉 Awarded 🏆1st place🏆 among 33 projects by a panel of data science professionals

View our 5-minute recorded presentation

Team Members:

Project Description

Inequity in the quality of healthcare based on social identity, such as your race or gender, is a long-standing issue in medicine. While this issue has increasingly received well-deserved attention, there is still a long way to go. One manifestation of this problem is the innacuracies in blood oxygen measurements in people of color when using a pulse oximeter. As stated by the PhysioNet team[1]:

Pulse oximeters are medical devices used to assess peripheral arterial oxygen saturation ($SpO_2$) noninvasively. In contrast, the "gold standard" requires arterial blood to be drawn to measure the arterial oxygen saturation ($SaO_2$). Pulse oximetry inaccuracies can fail to detect episodes of hidden hypoxemia, i.e., low $SaO_2$ with high $SpO_2$. Hidden hypoxemias can result in less treatment and increased mortality. Yet flawed, pulse oximeters remain ubiquitously used because of their ease of use; debiasing the underlying algorithms could alleviate the downstream repercussions of hidden hypoxemia.

We tackle this problem by developing two models, one to predict $SaO_2$ and one to predict Hidden Hypoxemia, using features that do not require a blood draw. Our model is trained on a publicly available dataset of de-identified medical records from ≈80,000 patient visits at Beth Israel Deaconess Medical Center in Boston, MA, between 2008 - 2019.

Our classification model is able to correctly predict which individuals have hidden hypoxemia in 7 out of 10 cases in our test dataset. Our regression model to predict arterial blook oxygen saturation ($SaO_2$) was also able to provide a better estimate of blood oxygenation, outperforming the current medical standard (oximeter reading alone) by 30% for patients with hypoxemia.

Evidence of Racial Bias

From our data, we can see that rates of Hidden Hypoxemia are higher for people of color.

Model and Data Details

Our models are built using XGBoost. The data starts with some erroneous $SpO_2$ and $SaO_2$, and some of the features have NaN entries. We clean our data by removing the erroneous data and using Scikit-Learns IterativeImputer to impute the NaN entries.

For the $SaO_2$ Regression Model, we logit-transformed $SpO_2$ and $SaO_2$ values to normalize values for a better regression fit. The dataset has a highly skewed distribution with most $SaO_2$ and $SpO_2$ values lying close to the upper measurement boundary of 100, making naïve regression ineffective. Our model provides estimates that are 30% closer to the actual blood oxygenation for patients with hypoxemia than the current medical standard of only using pulse oximetry measures.

Hidden hypoxemia is a rare event, accounting for only 1.6% of the patients in our dataset. Building a classifier that accurately finds these patients without overestimating their prevalence is difficult. For the HH Classification Model, we used an ensemble of gradient boosted forest classifiers created using XGBoost. The ensemble is trained on random undersamplings of the dataset in order to counter the imbalance in classes and artificially boost the prevalence of hidden hypoxemia in training.

Pipeline for the Classification Model

Dependencies

Running the code requires the following Python packages:

XGBoost, Scikit-Learn, Pandas, Numpy, Matplotlib

Interfacing with Models and Data Analyses

Exploratory data analysis: EDA_Oximetry_Data_MIMIC-IV.ipynb
$SaO_2$ Regression Model: SaO2_Regression_Model.ipynb file under the Notebooks folder.
Hidden Hypoxemia Classifier: Classifier_Undersampling.ipynb

Data Access

The data is publically available from PhysioNet. We have not provided the data here because accessing the data requires completion of online CITI training modules.

You can find information about completing training and then accessing the data at the bottom of the data access page: PhysioNet pulse oximetry correction dataset.

Applying the Classifier

We've deployed a web app showcasing a minimal feature version of our model. Here is an example run:

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
EDA		EDA
Notebooks		Notebooks
data		data
saved_models		saved_models
.gitignore		.gitignore
Erdos Data Science Bootcamp.twb		Erdos Data Science Bootcamp.twb
LICENSE		LICENSE
PulseOxCorrection_ExecutiveSummary.pdf		PulseOxCorrection_ExecutiveSummary.pdf
PulseOxCorrection_PresentationSlides.pdf		PulseOxCorrection_PresentationSlides.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Correcting Racial Bias in Pulse Oximeter Measurements of Blood Oxygen Saturation

Team Members:

Project Description

Evidence of Racial Bias

Model and Data Details

Pipeline for the Classification Model

Dependencies

Interfacing with Models and Data Analyses

Data Access

Applying the Classifier

About

Releases

Packages

Contributors 5

Languages

License

brooksminer/pulse-ox-correction

Folders and files

Latest commit

History

Repository files navigation

Correcting Racial Bias in Pulse Oximeter Measurements of Blood Oxygen Saturation

Team Members:

Project Description

Evidence of Racial Bias

Model and Data Details

Pipeline for the Classification Model

Dependencies

Interfacing with Models and Data Analyses

Data Access

Applying the Classifier

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages