# Analysis of Healthcare Payment Data for the Top 25 Costliest Drugs in the United States (2021-2022)
### by Claire O'Brien, Riley Yu, and Mike

## Introduction 
In this project, we focused on analyzing different patterns in the cost of the top 25 costliest drugs in the United States from the period between 2021 and 2022. The motivation behind our project is to allow people to understand what factors (such as drug category and payer type) influence prescription drug costs the most. Healthcare costs are a massive burden to the U.S. population, and documentation that supports laymen understanding of healthcare billing is scarce. As such, creating a project that models and visualizes what influences the costs of the most expensive drugs in the U.S. has value in that it provides insight into what routes of obtaining necessary medication are most affordable. 

## Data Description and Cleanup Practices
The data that we are using is from the [California Open Data Portal](https://data.ca.gov/dataset/healthcare-payments-data-hpd-fee-for-service-drug-costs). This is a composite dataset breaking down the top 25 most expensive drugs by payer type, drug category, and the type of information collected on each drug (median, frequency, cost, etc.). To make the data more digestible and manageable for analysis, we broke the composite dataset into a variety of separate datasets each determined by one analysis type (freq, cost, median) and table type, which provides tables based on overall costs, regular payers, and payers excluding MediCal. Each of these smaller datasets are more easily referenceable. 

## Results

The analyses we performed preliminarily suggest that both payer type and drug type (ie. whether the drug is brand name, generic, biologic, or biosimilar) are highly influential in the cost of the drug, while the prevalence of prescriptions for the drug is much less influential. We were also able to use a Random Forest model to predict drug cost across payer type, drug type, and utilization. 


## Drug Category's Impact on Ranked Cost
In this analysis, we focued on how the drug category (ie. whether it is biologic, biosimilar, brand, or generic) has a definitive impact on the cost of the drug. We compared the top-25 costliest drugs in the U.S. across their drug types. We can see that while for all of these drug types, there is a gradual decrease in price as ranking decreases, there is a very marked difference in price across drug types. For instance, Generic and Biosimilar drug types are much lower in cost than Biologic and Brand drug types. Likely, this is due to pharmaceutical markup, which is supported by the fact that among all the distinct categories, brand name drugs tend to be the most expensive. Replacements like Biosimilar and Generic-brand, which may tend to be less well-marketed but still effective are dramatically different in cost, on the order of 1-2 logs.
![rank_vs_totalcost_by_drug_category.png](attachment:735ea4b5-9468-432a-b4a9-437f5ad52251.png)

We can also see from the visualization below that there is a significant difference in price concentrations across the different drug type groups. While a small number of drugs make up the most extremely costly medications for both Biologic and Brand drug types, there is less extreme concentration of cost in Generic and Biosimilar types. We hypothesized that this was potentially because some brand-name drugs are prescribed to fewer people and thus the cost to make them "worth" selling is to increase the price severalfold.  
![rank_vs_totalcost_facet_drug_category.png](attachment:c18ae45c-1c73-44c0-a3f3-645f188b098c.png)

Our next analysis explored this question in greater depth, and the scatter plot that we made demonstrated that there is an incredibly weak association between cost and the number of prescriptions. This preliminarily indicates that the number of prescriptions for a medication per year has much less to do with the overall cost than we initially expected, suggesting that we overexaggerated the possible prevalence of price hikes related to medication scarcity. 
![totalcost_vs_prescriptions.png](attachment:692e8fe1-9fda-4438-ba70-16491070949e.png)

We also explored how patterns in drug usage are changing yearly. Demonstrated by this bar graph here, it is clear that prescription frequency has increased from 2021 in all categories, indicating a shift in either medication need or prescription practices by medical professionals. Additionally, year-by-year, Generic brand drugs remain the most highly prescribed drugs, indicating that insurers, doctors, and likely patients themselves are turning towards more cost-friendly yet still effective treatments. 
![top_freq_prescriptions_by_category_year.png](attachment:1a6256fe-e5ae-4f3e-8d25-d2ccee5d5d71.png)

In our final pre-modeling graphical analysis, we interrogated how payer type influences the initial billing cost of a drug. While Commercial and Medicare median costs for top 25 drugs are somewhat high, Medi-Cal median costs are dramatically lower, potentially reflecting the unique healthcare landscape in California, which tends towards providing subsidies. 
![payer_med_totalcost_by_payer.png](attachment:473d4399-7a5f-439b-83a6-2fddab97de43.png)

## Modeling Results: 
From here, we performed Random Forest, Ridge, and Baseline Mean modeling to attempt to predict cost from a variety of factors, including payer type, drug type, and utilization methods. The intent of this was to determine whether we could find a distinct, tractable pattern between the insurance factors that determine drug price to predict the price of more costly drugs, and hopefully allow users to understand more about what influences healthcare prices. From our testing R^2, we found that the Random Forest model worked the best!
[Screenshot 2025-12-17 at 6.10.05 PM.png](attachment:0df9af77-79f1-43a9-be6f-e52904a1cde2.png)
After testing and improving the model, we arrived at a final R^2 of 0.95, and our model appears to fit the data fairly well, with a few marked outliers. 
![residuals.png](attachment:2741e207-c642-465b-856a-cbe817bf4fd7.png)

## Author Contributions 
### Claire O'Brien

Initial Project Ideation: Found the initial data and outlined project ideas/potential modeling tasks.

Data Cleanup and Analysis: Wrote the main.ipynb notebook, did data cleanup in the data-cleanup.ipynb notebook.

MYST website: Initialized myst.yml and designed website layout, deployed to GitHub Pages, and used MYST to populate pdf_builds folder. 

Project Structure: Created data, images, and results folders. Created pdf-builds folder. Populated README.md and made binder badge. Made contribution-statement.md and ai-documentation.txt files. Added a license and BibTex file.

### Riley Yu: 

Data Analysis: Wrote the data-visualization.ipynb notebook

### Lin Da Miao: 

Data Analysis: Wrote the model_building.ipynb notebook

Testing and packaging: Made src package and pytest tests 

Environment and Make: Wrote environment.yml and makefile

## References 
California Open Data Portal. (2024). Healthcare Payments Data (HPD): Fee-For-Service Drug Costs. Department of Health Care Access and Information, Healthcare Payments Data (HPD) Program.https://data.ca.gov/dataset/healthcare-payments-data-hpd-fee-for-service-drug-costs.