# Data Exploration

This document outlines the plan for cleaning, splitting, and enhancing the initial dataset of the California Hospital Chargemasters before loading into the final data warehouse. 

### 1. Is the data homogenous in each column?

Looking at the sample of the data from the acquired excel sheet, all final columns are homogenous. The descirption values are all strings, the CPT code values are integers, and the price values are floats.

In [1]:
import pandas as pd

filepath = "Datasets/ChargeMasters.xlsx"
df = pd.read_excel(filepath, sheet_name=0, index_col = 0)
df.columns = ['Description', 'CPT Code', 'Amount']

df['CPT Code'] = pd.to_numeric(df['CPT Code'], errors='coerce')
df['Amount'] = pd.to_numeric(df['Amount'], errors='coerce')

df.dropna(axis=0, inplace=True)
df['CPT Code'] = df['CPT Code'].astype(int)

print(df.dtypes)
print("All values under the Description column are strings:", all([type(row)==str for row in df.Description]))

Description     object
CPT Code         int32
Amount         float64
dtype: object
All values under the Description column are strings: True


### 2. How do you anticipate this data will be used by data analysts and scientists downstream?

With hosptials and clinics contributing to half of total health spending in the US, it would be beneficial to track the expense breakdowns of these facilities. I anticipate this data containing over 10 years of hospital charge history to be used to determine the cost of medical attention on the procedural level, and compare these price changes against other health expenditures. For instance, comparing the cost of hospital expenses with presciption medicine, nursing care, or dental care, and finding the distibution of total healthcare expenses per year across these categories. 

Furthermore, as out-of-pocket expenses and health insurance premiums continue to increase, it would be more efficient for patients to understand the cost of their healthcare needs prior to hospital admission. Data analysts could reveal the actual costs of these procedures and identify the gaps between health insurance coverage and hosptial charges. The data could also be used to compare the prices between hopsitals on the same procedures so patients would have a more hollistic understanding before seeking medical service. 

On a more general view, the hopsital charge data could also be used in the public health sector to find out how medical costs vary across different regions. It could reveal how the average income, tax rate, or overall population health could affect the resulting cost of the same procedure in differnt cities.

### 3. Does your answer to the last question give you an indication of how you can store the data for optimal querying speed and storage file compression?

I will store the hospital charges data to have a separate table for each year. Each year will contain all the master charges available for that year. I will also generate a hospital code for each hospital name in order to index the data and retrieve information on each hospital much faster. 

### 4. What cleaning steps do you need to perform to make your dataset ready for consumption?

I need to ensure that all hosptial charges are formatted the same way as the dataframe from step 1, and that all NA values are removed to save disk space. I would also like to restructure the data so that all sheets are consolidated into one table, with the 'Hospital Name' as an additional column instead of the sheet name. 

### 5. What wrangling steps do you need to perform to enrich your dataset with additional information?
In addition to the Hospital Chargemaster dataset, I also want to include a dataset of the CPT codes, hospital locations, insurance cost on each procedure, and the insurance providers offered at each hospital. A entity-relationship diagram is shown below to illustrate the f
    
<img src="https://github.com/beatricetierra/US-Hospital-Charges/blob/main/ERD.png">