Erdős Institute Data Science Boot Camp, Spring 2023.
🎉 Awarded 🏆1st place🏆 among 33 projects by a panel of data science professionals
- View our 5-minute recorded presentation
Inequity in the quality of healthcare based on social identity, such as your race or gender, is a long-standing issue in medicine. While this issue has increasingly received well-deserved attention, there is still a long way to go. One manifestation of this problem is the innacuracies in blood oxygen measurements in people of color when using a pulse oximeter. As stated by the PhysioNet team[1]:
Pulse oximeters are medical devices used to assess peripheral arterial oxygen saturation (
$SpO_2$ ) noninvasively. In contrast, the "gold standard" requires arterial blood to be drawn to measure the arterial oxygen saturation ($SaO_2$ ). Pulse oximetry inaccuracies can fail to detect episodes of hidden hypoxemia, i.e., low$SaO_2$ with high$SpO_2$ . Hidden hypoxemias can result in less treatment and increased mortality. Yet flawed, pulse oximeters remain ubiquitously used because of their ease of use; debiasing the underlying algorithms could alleviate the downstream repercussions of hidden hypoxemia.
We tackle this problem by developing two models, one to predict
Our classification model is able to correctly predict which individuals have hidden hypoxemia in 7 out of 10 cases in our test dataset. Our regression model to predict arterial blook oxygen saturation (
From our data, we can see that rates of Hidden Hypoxemia are higher for people of color.
Our models are built using XGBoost. The data starts with some erroneous
For the
Hidden hypoxemia is a rare event, accounting for only 1.6% of the patients in our dataset. Building a classifier that accurately finds these patients without overestimating their prevalence is difficult. For the HH Classification Model, we used an ensemble of gradient boosted forest classifiers created using XGBoost. The ensemble is trained on random undersamplings of the dataset in order to counter the imbalance in classes and artificially boost the prevalence of hidden hypoxemia in training.
Running the code requires the following Python packages:
XGBoost, Scikit-Learn, Pandas, Numpy, Matplotlib
- Exploratory data analysis:
EDA_Oximetry_Data_MIMIC-IV.ipynb
-
$SaO_2$ Regression Model:SaO2_Regression_Model.ipynb
file under the Notebooks folder. - Hidden Hypoxemia Classifier:
Classifier_Undersampling.ipynb
The data is publically available from PhysioNet. We have not provided the data here because accessing the data requires completion of online CITI training modules.
You can find information about completing training and then accessing the data at the bottom of the data access page: PhysioNet pulse oximetry correction dataset.
We've deployed a web app showcasing a minimal feature version of our model. Here is an example run: