---
title: Exploring Societal Inequity's Effect on (Model-Perceived) Health Outcomes 
author: Sophie Seiple, Julia Joy, Lindsey Schweitzer
date: '2024-04-18'
description: "Final Project Blog Post"
bibliography: refs.bib
format: html
---


## Abstract

## Introduction
For our project, we aim to explore the relationship between diseases and social factors such as sex, race, and town, and how these may reflect societal and enviornmental inequities. Our approach is to identify the most accurate predictive model for our dataset, then use this model to generate risk likelihood scores and evaluate the relationship between different diseases and characteristics indicative of societal inequalities. We will then analyze the implications of these risk factors for inequitable, identity-based risk factors in health outcomes and complications. Our project consists of three documents, one in which we clean our original data, one in which we explore this data visually, and the final one, this one, in which we build and explore our models.

Taking the general trends we witness in our data visualization document, we carried out the second half of our project; building a model that predicts risk scores. Comparing the risk scores, we wanted to see whether trends emerged in terms of socioeconomic status (which we measure by the proxy of town of residence), race, gender, and ethnicity.

## Values Statement
The motivation behind our project was to uncover potential inequities in the manifestations of certain conditions, for example does a persons race or socioeconomic status predispose them to certain conditions more than others. Our goal was to identify potential societal and environmental factors that unjustly, or disproportionately contribute to disparities in health outcomes. Our focus on this project stems from a desire to understand and address societal and evironmental inequities that contribute to disparities in health outcomes, and our personal commitments to promoting equity and social justice in healthcare.

The primary potential users of our project would include researchers, policymakers, and public health organizations interested in understanding and addressing health inequities. However, the project's findings and potential implications could also affect the communities we study, especially those that we find experience disparities in health outcomes due to social determinants. 

If our research were to be taken out of context by researchers and health professionals, and taken to be a study of biological predisposition, and not of the manifestation of social factors, our results may reinforce assumptions about health outcomes by race and ethnicity in the medical field, enforcing harmful stereotypes or leading to further marginalization of certain groups. Additionally, if the data or models have inherent biases, they could perpetuate or amplify existing disparities.

With proper usage and implementation though, we hope our results would positively impact public health programs and initiatives that work in preventative measures in the most at-risk communities. With our data, we hope that these measures would more easily idenity communities in which to center efforts and awareness campaigns, by shedding light on health inequities and informing efforts to address them. 

## Material & Methods

### Our Data 

Our project utilizes a synthetic data set created for an Introduction to Biomedical Data Science Textbook. The data was created using Synthea, a synthetic patient generator that models the medical history of synthetic patients. Synthea’s mission is “to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare.” This allowed for much easier access than real patient data, as well as alleviating any privacy concerns that would arise from using real patient data. 
The link to the data can be found here: https://data.world/siyeh/synthetic-medical-data. 

Our dataset was originally quite large, with over 200 Million entries. After thorough data cleaning and preprocessing, the data was then transformed to multiple CSV documents, generally with the format of each row representing a different patient with one-hot-encoded values for multiple disease conditions. 

While using synthetic data has its benefits, it is essential to acknowledge certain inherent limitations. Firstly, despite efforts to create diverse and representative synthetic patients, there may still be discrepancies in representing certain demographic groups or medical conditions accurately. Certain rare or uncommon medical conditions may be underrepresented in the dataset due to the limitations of the modeling and analyses processes. This data is generated to represent patients from Massachusetts, so any generalization of results must proceed with caution. Thus while this synthetic dataset serves as a valuable resource for educational purposes, researchers and practitioners should approach its use with an understanding of its limitations. 

After cleaning our data, we perfomed exploratory data analysis in order to visualize out datset, the results of which are found in [this extension of our materials and methods section](google.com).


### Our Approach 

Since our original dataset was quite large, a thorough process of data cleaning and preprocess was needed, as well as an evaluation of which parts and features of our data should be actively used as predictors for our models. We subset our data into different CSV files, each entry to a given CSV corresponding to the different type of condition. This allowed for our models to be trained more concisely and efficiently, as well as increasing interpretability of results. [This extension or methods and materials shows our data cleaning pocess in more depth.](google.com)

Multiple models were trained for each analysis of a condition group, including a logistic regression model, a decision tree classifier, a random forest classifier, and a support vector machine. These models were then evaluated for best score, using cross-validation, given its performance for a specific condition group. The best model, i.e. the one returning the highest cross-validated accuracy, was chosen as the predictive model for our general risk scores. We then trained this model, which ended up being ??????(WHAT DID IT END UP BEING) on our training dataset, and created predictions for our testing data that represented the probability of each entry being 1 (having a certain condition) or 0 (not having a certain condition). A risk score could then be anything between 0.00 and 1.00, where 0.50 would represent a 50% probability that the given patient has a condition. The models ran on our own personal devices, on the ML-0451 class kernel. 

### Critical Discussion
The goal of our presentation is to analyze the bias present in our healthcare system and the risk of certain groups of different illnesses and health conditions. There are many organizations that might find this type of model useful or interesting. One interested party could be a hospital that wants to allocate resources based on the communities they serve. This could be helpful as they could adapt to real community needs. A similar use case could be if a town is building or allocating healthcare resources and wants to understand the risks of their township or locality. Hopefully, this model could help allow resources to go to the places in which there is great need. However, an important note is that this dataset measures the recorded rates of a hospital setting. This could widely vary from real illness rates, as certain communities are under-treated or under-diagnosed in the US healthcare system.

A more harmful use case could be an insurance company that could incorporate this model into their decision to cover individuals or not. Therefore, our model has the risks, if put in the wrong hands, to have a negative impact on already marginalized communities. Seeing as insurance companies are incredibly wealthy, it is likely that this could be a body that would be financing this project. This raises the question of whether this model should be allowed to be employed in decision-making scenarios.

We completed this work out of curiosity as part of an educational pursuit. If used for knowledge or understanding of the impact of different illnesses and conditions on identity groups, it can be helpful and informative. However, there is also the risk of further harming groups that have already been historically marginalized in medicine.

## Concluding Discussion

Our project was able to accomplish our goal of analyzing risk rates for various illnesses and conditions for different identities. Due to the large quantity of data we possessed, we were unable to analyze all of the data we had access to to make predictions. Ideally, we would have been able to predict medication use or various observations in addition to specific conditions. Also, if we had access to more data, we could have made more specific predictions- like for asthma instead of general lung ailments. If we had more time, computational resources, and data we would like to extend our study to include healthcare information for different conditions as well as different geographical regions outside of Massachusetts. By amplifying the range of data we include we would be able to come to more concrete conclusions on different risk rates. However, we were able to complete our aspirations for this project by generating risk rates for race, gender, ethnicity, birthplace, and current address for five different ailments.

Our results compare to the results of those who have studied similar problems. For example, there is a large quantity of scientific data that shows that people at lower socio-economic status are more likely to get diabetes [Linked Text](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4603875/#:~:text=For%20example%2C%209.0%20%25%20of%20those,%2480%2C000%20per%20year%20had%20diabetes.). Furthermore, race has been strongly connected to material mortality and health. Specifically, black and hispanic women are at much higher risk of issues with pregnancy than their white counterparts [Linked Text](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7290488/). We saw both these trends and more replicated in our model's predictions. Therefore we can conclude that our model is creating predictions that are correlated with real life trends.

## Results

The first extension of our results section is [this document](google.com), in which we run our model and generate the risk scores and comparisons we discuss in further depth here.

In [10]:
# re-entering the results we found in our model doc for visualization
import pandas as pd

# diabetes results

diarace_data = {
    'Race' : ['Asian', 'Hispanic', 'White', 'Black'],
    'Risk Score' : [0.479391, 0.326581, 0.316872, 0.238107]
}

diatable_race = pd.DataFrame(diarace_data)

diaeth_data = {
    'Ethnicity' : ['Asian Indian', 'Polish', 'German', 'Mexican', 'American', 
    'Portugese', 'English', 'Scottish', 'Italian', 'Dominican', 'Puerto Rican',
    'African', 'Central American', 'French', 'French Canadian', 'Swedish', 'Chinese',
    'Russian', 'Irish', 'West Indian'],
    'Risk Score' : [0.714991, 0.582585, 0.501140, 0.428841, 0.416128, 0.395835, 0.371874, 0.341331, 0.319902, 0.312293, 0.311334, 0.292001, 0.290537, 0.276922, 0.258344, 0.257813, 0.243792, 0.201778, 0.187961, 0.002464]
}

diatable_eth = pd.DataFrame(diaeth_data)

diagen_data = {
    'Gender' : ['Male', 'Female'],
    'Risk Score' : [0.268108, 0.369897]
}

diatable_gen = pd.DataFrame(diagen_data)

In [2]:
diatable_race

Unnamed: 0,Race,Risk Score
0,Asian,0.479858
1,Hispanic,0.323339
2,White,0.315055
3,Black,0.242059


In [9]:
diatable_eth

Unnamed: 0,Race,Risk Score
0,Asian Indian,0.714991
1,Polish,0.582585
2,German,0.50114
3,Mexican,0.428841
4,American,0.416128
5,Portugese,0.395835
6,English,0.371874
7,Scottish,0.341331
8,Italian,0.319902
9,Dominican,0.312293


talk abt how ethnicities are self-reported

## Group Contributions

In your group contributions statement, please include a short paragraph for each group member describing how they contributed to the project:

Who worked on which parts of the source code?
Who performed or visualized which experiments?
Who led the writing of which parts of the blog post?
Etc.