<a href="https://www.kaggle.com/code/absndus/data-science-portfolio-healthcare-and-boxplots?scriptVersionId=136051134" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Data Science Portfolio - Understanding Healthcare Using Boxplots Notebook ##

### Created by: Albert Schultz ###

### Date Created: 07/07/2023 ###

### Version: 1.00 ###

### Executive Summary ###
In this notebook, I will be using the boxplots visuals to investigate the way hospitals in various states across the United States charge their patients for medical procedures. 

## Table of Contents ##

1. [Introduction](#1.-Introduction)
2. [Vision and Goals](#2.-Vision-and-Goals)
3. [Load the Healthcare Dataset from US Health and Human Services (HHS)](#3.-Load-the-Healthcare-Dataset-from-US-Health-and-Human-Services-(HHS))
4. [Perform Transformation of the Dataframe](#4.-Perform-Transformation-of-the-Dataframe)
5. [Perform EDA on the Cleaned Dataset](#5.-Perform-EDA-on-the-Cleaned-Dataset)
6. [Summary](#Summary)

## 1. Introduction ##

This section imports the required library modules needed for this lab notebook.

**Initialize the Notebook for data access, import library modules, and set the working directory for this project.**

In [None]:
import pandas as pd #For statistical data analysis. 
import numpy as np #For statistical analysis using various statistical functions. 
from matplotlib import pyplot as plt #For plotting advance graphs and boxplots. 
import json
import matplotlib.ticker as mtick
import requests
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 2. Vision and Goals ##

The vision is to make sense of the impatient dataset via boxplots and distribution charts (histograms) to view insights of the CMS Healthcare Impatient dataset. 

**Vision:** To understand the aspects and make meaningful insights of the CMS Impatient Healthcare dataset. 

**Goals:** 
1. Pull and review the dataset from 2020 to 2021 Impatient Healthcare of Hospital dataset via the US HHS datasets dashboard. 
2. Review for missing data and mismatch columns. 
3. Perform ETL when needed to make the dataframe more clean for ease of EDA. 
4. Perform EDA. 
5. Create boxplot and distribution histograms. 

## 3. Load the Healthcare Dataset from US Health and Human Services (HHS) ##

This section goes over the process of loading the large US HHS csv file into this notebook. Also, I will be going through the dataset for review to get acquainted with the dataset that tells the story of various hospital that charges patients.

1. Import the healthcare from the United States Health and Human Services (HSS) department for year 2020 to 2021. 

In [None]:
healthcare = pd.read_csv('/kaggle/input/cms-hc-impatient-dataset-2020-to-2021/MUP_IHP_RY23_P03_V10_DY21_PRVSVC2.csv')

2. Print the first five rows. 

In [None]:
healthcare.head()

3. Print out the dtypes of the columns to view the columns. 

In [None]:
healthcare.dtypes

**Dataset Summary:** All of the table columns are set to strings data type. I would need to convert payments and billing columns into float instead of leaving them as strings. 

4. rint out the information about the healthcare dataframe. 

In [None]:
healthcare.info()

**Dataset Summary:** The columns do not have missing data. However, it doesn't mean that the values are considered empty. They may have 'n/a' as a placeholder as well. 

## 4. Perform Transformation of the Dataframe ##

In this section, I go over the process to convert some columns into proper datatype to perform EDA much easier. 

1. Lower case the columns to meet the dataframe standards. 

In [None]:
healthcare.columns = map(str.lower, healthcare.columns)
healthcare.head()

2. Change the columns **tot_dschrgs, avg_submtd_cvrd_chrg, avg_tot_pymt_amt, avg_mdcr_pymt_amt** from object to float.

In [None]:
#Convert last several columns of the healthcare dataframe from object to numeric.
healthcare['tot_dschrgs'] = pd.to_numeric(healthcare['tot_dschrgs'], errors = 'coerce')
healthcare['avg_submtd_cvrd_chrg'] = pd.to_numeric(healthcare['avg_submtd_cvrd_chrg'], errors = 'coerce')
healthcare['avg_tot_pymt_amt'] = pd.to_numeric(healthcare['avg_tot_pymt_amt'], errors = 'coerce')
healthcare['avg_mdcr_pymt_amt'] = pd.to_numeric(healthcare['avg_mdcr_pymt_amt'], errors = 'coerce')
healthcare['drg_cd'] = pd.to_numeric(healthcare['drg_cd'], errors = 'coerce')
healthcare.dtypes

3. Print the first five rows of the updated healthcare dataframe.

In [None]:
healthcare.head()

4. Resize the decimal places for payments to two decimal points for the payments amount columns. 

In [None]:
healthcare[['avg_submtd_cvrd_chrg', 'avg_tot_pymt_amt', 'avg_mdcr_pymt_amt']] = healthcare[['avg_submtd_cvrd_chrg', 'avg_tot_pymt_amt', 'avg_mdcr_pymt_amt']].round(2)

5. Create a new variable called **healthcare_diagnosis** that only contains unique diagnosis from the healthcare dataset.

In [None]:
healthcare_diagnosis = healthcare['drg_desc'].unique()

6. Create a new variable called **chest_pain_diag_healthcare** that contains just the numbers of 'CHEST PAIN' values. 

In [None]:
chest_pain_diag_healthcare = healthcare.loc[healthcare['drg_desc'].str.startswith('CHEST')]
chest_pain_diag_healthcare

7. Separate the chst pain filtered dataset into the providers' state in **AL for Alabama**. 

In [None]:
alabam_chest_pain = chest_pain_diag_healthcare[chest_pain_diag_healthcare['rndrng_prvdr_state_abrvtn'] == 'AL']
alabam_chest_pain

8. Create a variable **costs** to store the Alabama chest pain dataset. 

In [None]:
al_costs = chest_pain_diag_healthcare['avg_submtd_cvrd_chrg'].values

9. Create dataframes for the state 

In [None]:
diabetes_diag_healthcare = healthcare.loc[healthcare['drg_desc'].str.startswith('DIABETES')] #Filter by starting word DIABETES. 
nodak_diabetes = diabetes_diag_healthcare[diabetes_diag_healthcare['rndrng_prvdr_state_abrvtn'] == 'ND'] #Filter out by state of North Dakota. 
nodak_chestpain = chest_pain_diag_healthcare[chest_pain_diag_healthcare['rndrng_prvdr_state_abrvtn'] == 'ND'] #Filter out by state of North Dakota

## 5. Perform EDA on the Cleaned Dataset ##

In this section, I go over the EDA of the cleaned healthcare dataset along with the new dataframes created. 

1. Create a boxplot of the **costs** of the North Dakota healthcare covered cost for chest pains. 

In [None]:
fig, ax = plt.subplots()
ax.boxplot(nodak_chestpain['avg_submtd_cvrd_chrg'], labels = ['Hospitals'])
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('${x:,.2f}'))
plt.title('North Dakota Chest Pains')
plt.show()
plt.clf()

**Visual Summary:** There were lower amounts of chest pain in North Dakota than other states. 

2. Create a boxplot for **chest pain** average covered costs for all states to see which providers in those states have low to highest covered costs from 2020 to 2021.  

In [None]:
#Get the state level unique data about all of the diagnosis. 
states = healthcare["rndrng_prvdr_state_abrvtn"].unique()

#Use the for loop to separate the dataset into a dataset for each state. 
datasets = []
for state in states:
    datasets.append(chest_pain_diag_healthcare[chest_pain_diag_healthcare['rndrng_prvdr_state_abrvtn'] == state]['avg_submtd_cvrd_chrg'].values)
    
#Plot 50 box plots of the average covered charge from various diagnosis. 
fig, ax = plt.subplots(figsize = (20, 6))
ax.boxplot(datasets, labels = states)
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('${x:,.2f}'))
plt.title('Chest Pains Covered Costs Across the US States')
plt.xlabel('States in the US')
plt.ylabel('Cost in $USD')
plt.show()
plt.clf()

**Visual Summary:** California has quite a few outliers than any other states when it comes to covered cost for chest pains. 

3. Create a boxplot to view the North Dakota state information about the diabetes related costs data to see the covered costs across NoDak. 

In [None]:
fig, ax = plt.subplots()
ax.boxplot(nodak_diabetes['avg_submtd_cvrd_chrg'], labels = ['Hospitals'])
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('${x:,.2f}'))
plt.title('North Dakota Diabetes')
plt.show()
plt.clf()

**Visual Summary:** Throughout North Dakota, there were more diabetes related covered costs than chest pain in North Dakota. 

4. Create a boxplot for **diabetes** related costs that were covered between year 2020 to 2021 for all of the states in the United States. 

In [None]:
#Get the state level unique data about all of the diagnosis. 
states = healthcare["rndrng_prvdr_state_abrvtn"].unique()

#Use the for loop to separate the dataset into a dataset for each state. 
datasets = []
for state in states:
    datasets.append(diabetes_diag_healthcare[diabetes_diag_healthcare['rndrng_prvdr_state_abrvtn'] == state]['avg_submtd_cvrd_chrg'].values)
    
#Plot 50 box plots of the average covered charge from various diagnosis. 
fig, ax = plt.subplots(figsize = (20, 6))
ax.boxplot(datasets, labels = states)
ax.yaxis.set_major_formatter(mtick.StrMethodFormatter('${x:,.2f}'))
plt.title('Diabetes Covered Costs Across the US States')
plt.xlabel('States in the US')
plt.ylabel('Cost in $USD')
plt.show()
plt.clf()

**Visual Summary:** As you can see, diabetes is one of the major factor in the US for health related surgeries and diagnosis in the US. 

## Summary ##

This project notebook heavily went over the process of reviewing boxplots and how to read boxplots to understand the CMS healthcare inpatient hospital dataset. This dataset from 2020 to 2021 can be explored further if you wish. Feel free to copy and edit the notebook in your own Kaggle environment. 