# COGS 108 - Project Proposal

# Names

- Aileen Phuong 
- Arohan Mittal
- Genevieve Selsor 
- Duong Ngo
- Ryan Christian Nalangan

# Research Question

How accessible are mental health resources such as therapist counts and cost of services in low-income zipcodes/counties compared to more affluent ones in California?



## Background and Prior Work

Mental health care is an essential component of health care, but it is still one of the most neglected areas within the U.S. healthcare system. Millions of people in the U.S. experience many different mental health conditions every year, with those conditions ranging from anxiety disorders, depression, or PTSD, and that experience is different for everyone, especially for marginalized communities. Communities of a lower socioeconomic status are more likely to be at the higher end of barriers due to their lower income status. These barriers can range in cost, not having a provider in their community, no insurance or underinsured, and long wait times for care or treatment. Because of this, we are leaving people behind without care and without the support they need when they need it most, further perpetuating and contributing to such health disparities.

Research has repeatedly established that low-income and minority communities experience unreasonably more difficulty in accessing mental health care. As reported by the Centers for Disease Control and Prevention (CDC), adults living at or below the national poverty level are more likely to report frequent mental distress; but they are also the least likely to receive adequate treatment.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) The National Alliance for Mental Illness (NAMI) reports that many providers will not accept public insurance programs (e.g. Medicaid) that are often used by low-income populations.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Additionally, there are treatment access disparities for racial and ethnic minorities that were limited for Latinx, Asians, and Black communities, as demonstrated in a study in 2018 by Alegría et al., which highlighted notable barriers in access to treatment that are tied to distance and economy.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Even if these studies indicate widespread equality, research has mostly focused on broad inequalities or large-scale statistics. Despite those sources that indicate neighborhood-level inequalities, especially urban environments that can differ from communities to communities, there does not appear to be much research that compares access at the neighborhood level. Access to care can vary dramatically depending on income level, zoning laws, and clinic location. Our project aims to complete a purpose that compares mental health resources in low-income neighborhoods to high-income neighborhoods using publicly available data. By investigating issues including number of providers, provider acceptance of insurance, and more specific variables in the project, this analysis hopes to contribute to a more practical understanding of access to mental health resources at a local level.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)

1. <a name="cite_note-1"></a> [^](#cite_ref-1) CDC. (2021). Mental Health Data and Publications. https://www.cdc.gov/mentalhealth/data_publications/index.htm
2. <a name="cite_note-2"></a> [^](#cite_ref-2) NAMI. (2020). Access to Mental Health Care. https://www.nami.org/Advocacy/Policy-Priorities/Access-to-Mental-Health-Care
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Alegría, M., et al. (2018). Disparities in access to mental health treatment: Results from the National Latino and Asian American Study. Psychiatric Services, 59(11)
4. <a name="cite_note-4"></a> [^](#cite_ref-4) APA. (2022). Disparities in Mental Health. https://www.apa.org/monitor/2022/07/disparities-mental-health

# Hypothesis


Our hypothesis is that mental health related supports are far less likely to be present in lower income neighborhoods versus higher income neighborhoods. We expect that we will find fewer licensed mental health providers per capita, median household income, and lower rates of acceptance for public insurance such as Medicaid in lower income areas. We think this will happen because studies show that providers are more likely to cluster in wealthier areas where patients are more likely to pay out-of-pocket or have private insurance, thus leaving poorer neighborhoods to suffer from having fewer services available and greater access barriers.

# Data

**Our ideal dataset:**

- Possible variables to determine high and low income areas would be median household income or home values in different zip codes/counties. For mental health resources, we would use data on how many mental health therapists and mental health facilities are available in each zip code/county, as well as the health insurance available in each zip code/county.

- If we are using Zillow to determine the average cost of homes in different zip codes, we would get roughly 1,700 data points that would then be matched to the other roughly 1,700 data points of therapist and health insurance plans provided in those zip codes. However due to smaller counties or zip codes, we will most likely be aiming for a dataset of at least 1000. Ideally, we would like to have data for each zip code, or county to see widespread trends across the state.

- The data for determining the income level of each zip code would be from public data on Zillow and government census of household income and poverty status. The data on Zillow is provided through a csv file, while the government census is obtainable through their API. The number of therapists per zip code would have to be web scraped and checked for duplicates. The health insurance by zip code is provided through a csv file that can be more easily sorted. The dataset of mental health facilities in CA is stored in a csv file that includes zip codes and counties.

- We would store all of our data in separate data frames depending on where the information came from, using python library pandas, we can join certain datasets using zip codes to combine data. Ideally we should have two tables, one for household income and average house price by zip code for one table, while the second table would be the number of therapists and health insurance available per zip code.

**Potential datasets:**

Data for distinguishing between high and low income communities:

- https://www.zillow.com/research/data/  zillow data; using the “Home Values at the top” chose “All Homes” and Geography as “City” - We would need to clean the data to only CA cities, We would have to cross relate City names to zipcode names in another Zillow data table in order to get all California zip codes 

- https://www.census.gov/programs-surveys/acs/technical-documentation/table-and-geography-changes/2023/1-year.html (census data on demographics throughout the U.S. in 2023; includes household income, poverty status, etc.; can be aggregated by zip code/state/county)

- https://data.ca.gov/dataset/income-limits-by-county California State Income Limits reflect updated median income and household income levels for acutely low-, extremely low-, very low-, low- and moderate-income households for California’s 58 counties (if we want to aggregate by county as well)

- http://www.usa.com/rank/california-state--median-household-income--zip-code-rank.htm median household income by zipcode in CA, also includes population of each zip code


Data for Mental Health:

- https://www.theravive.com/zip/ca/ - Has California listed by Zip Codes - Potentially web scrape the data of how many therapist are employed in certain zip codes

- https://locator.apa.org/ - Feature allows for search zip codes for therapist  - Potentially web scrape the data of how many therapist are employed in certain zip codes

- https://www.cdc.gov/nchs - NHIS Early Release: National Health Insurance Coverage - Has good overall health insurance data 

- https://hbex.coveredca.com/data-research/ - Has data on what health insurance is available in each zip code - This is provided in a CSV - Data would need to be cleaned and most likely re-organized.

- https://behavioralhealth-data.dhcs.ca.gov/datasets/CADHCS::pos2023-adult-demo-data/explore - has data on number of adults 21+ who have received specialty mental health services by county

- https://data.chhs.ca.gov/gl/dataset/licensed-mental-health-rehabilitation-centers-mhrc-and-psychiatric-health-facilities-phf This dataset contains all Mental Health Rehabilitation Centers (MHRC) and Psychiatric Health Facilities (PHF) licensed by the Department of Health Care Services (DHCS) includes county and zip code

# Ethics & Privacy

The project proposes a study on mental health, which is already considered a sensitive issue for many people. Furthermore, the project is based very heavily on geolocation (most likely via zip codes), and it’s possible that it may reinforce stereotypes; a lot of care should be taken to avoid sweeping generalizations, since it is not a good idea to make social comments in answering a question like this. 

- Firstly, there’s a possibility that the hypothesis itself could be misconstrued (i.e. just because someone has access to less money doesn’t necessarily mean that they have access to worse healthcare, or just because someone is rich doesn’t mean that they cannot have mental health issues). 

- Secondly, scraping data involves coming into contact with very sensitive and possibly identifiable information. Many people are sensitive about how much they earn, and even scraping therapist listings might get us information that we do not necessarily need. 

- Thirdly, there may be problems with taking typical values such as mean or median as representative of income. There may be massive variance in income within even a single zip code, and this could skew our findings to the wrong conclusion, which could be misused. 

There may also be significant bias in using websites like Zillow to make conclusions about general income levels of neighborhoods. For example, Zillow may not list both ends of outliers such as abandoned or run-down buildings and large houses that have been passed down through generations without being sold. This is not a huge problem when such houses are outliers, but there are many neighborhoods where such homes are dominant, which may make Zillow a poor source of data in determining average income for these neighborhoods.


In the same order as the problems were listed:

- We will have to be very careful with the words we use, and make sure not to bring any outside factors outside of what exactly we are using to make our hypothesis and conclusions (e.g. mean/median house prices or avg. Zillow listing by pincode, count of therapists in a zip code, etc.)

- For problems regarding privacy, it is our duty to make sure to drop any data that we are not using as soon as possible, and avoid using fringe data in our analysis. We should do exactly what our hypothesis and proposal are set out to do, and we should avoid testing the effect of a variety of different factors so that we do not encounter a facetious relationship based purely on the number of relationships that we want and check.

- Since the problem is with typical values, we could consider stratifying our data to account for neighborhoods with high and low variance respectively (e.g. separate bins for rich neighborhoods, poor neighborhoods, and hybrid neighborhoods) 

- Since the problem of bias arises when Zillow does not have listings for areas which are dominated by run-down or ancestral buildings, we could create a threshold value based on the size of an area for which if we see a number of listing below that number, it may indicate that Zillow is not representing that area accurately. This will let us determine what to do with outlier neighborhoods on a more case-to-case basis.

**Are there any biases/privacy/terms of use issues with the data you proposed?**

- As mentioned before, data may include personally identifiable information about therapists or house owners, which is a sensitive issue. Furthermore, Zillow may have its own terms of use for data that we scrape from it, which should be kept in mind.

**Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)**

- The data may be problematic because scraping from a house listing website like Zillow may mean excluding homeless populations, which would not be represented on such a dataset. It may also not accurately represent the renting population of the area.

**How will you set out to detect these specific biases before, during, and after/when communicating your analysis?**

- Before analysis, we could check for missing data or skew within specific zip codes. This could help us locate neighborhoods with large quantities of underrepresented homes and/or people. During analysis, we could stratify our data to account for different income groups. This could prevent outliers from shifting our results drastically. After analysis, we could review with our peers to make sure that we have accounted for all issues that we have mentioned and not mentioned, and make sure that our wording is exact.

**Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?**

- As mentioned earlier, the data might unfairly misrepresent incomes and availability of mental health care within neighborhoods. It may not account for more informal mental health practices either, and it may create negative connotations about certain areas being more unfriendly to such practice.

**How will you handle issues you identified?**

- A lot of this is covered in points mentioned earlier. Additionally to those, however, we could also consider using additional data sources such as Census data to validate our data categories and make sure that any analysis we do is reasonable.


# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1:* Communication by Discord group chat
* *Team Expectation 2:* Weekly meeting every Monday at 7PM
* *Team Expecation 3:* Be nice
* *Team Expecation 4:* Communicate about deadlines
* *Team Expecation 5:* Ensure each team member has a task to complete

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date | Meeting Time | Completed Before Meeting                                      | Discuss at Meeting                                                                 |
|--------------|--------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------|
| 4/21         | 7 PM         | Read & think about COGS 108 expectations; brainstorm topics/questions | Determine best form of communication; discuss and decide on final project topic; discuss hypothesis; begin background research |
| 4/28         | 7 PM         | Do background research on topic                               | Discuss ideal dataset(s) and ethics; draft project proposal                        |
| 4/30         | 3 PM         | Find Priors (Ryan, Joyce); Find Possible Datasets (Aileen, Genevieve); Discuss Ethics (Arohan) | Edit, finalize, and submit proposal; assign group members to lead each specific part |
| 5/5          | 7 PM         | Import & wrangle data (Unassigned); EDA (Unassigned)          | Discuss wrangling and possible analytical approaches; pick specific datasets       |
| 5/12         | 7 PM         | Finalize wrangling/EDA; begin analysis (Unassigned)           | Review/edit wrangling/EDA; discuss analysis plan                                   |
| 5/19         | 7 PM         | Complete analysis; draft results/conclusion/discussion (Unassigned) | Discuss/edit analysis; complete project check-in                                   |
| 5/26         | 7 PM         | Complete analysis; draft results/conclusion/discussion (Unassigned) | Discuss/edit full project                                                          |
| 6/13         | Before 11:59 PM | NA                                                           | Turn in Final Project & Group Project Surveys                                      |