## Load Full Hospital Cost Reports from SQLite

We connect to the `hospital_costs.db` SQLite database and read the entire `costs_all` table—which contains the raw cost report entries for all hospitals—into a Pandas DataFrame. Displaying the DataFrame’s dimensions and a preview confirms successful import and shows the table’s structure.

In [1]:
import sqlite3
import pandas as pd

# Adjust path if your notebook is in Notebooks/ (one level down from the project root)
db_path = "../hospital_costs.db"
conn = sqlite3.connect(db_path)

# Load the entire costs_all table into a DataFrame
cost_reports = pd.read_sql_query("SELECT * FROM costs_all;", conn)

# Show shape and a quick preview
print("Rows × Cols:", cost_reports.shape)
cost_reports.head()

Rows × Cols: (29175, 10)


Unnamed: 0,CCN,Hospital_Name,Total_Charges,Total_Payment,Address,City,State,County,Ownership,Report_Fiscal_Year
0,341318,CHOWAN HOSPITAL INC.,36330775,56681179,211 VIRGINIA AVENUE,EDENTON,NC,CHOWAN,2,2018
1,102012,SPECIALTY HOSPITAL OF JACKSONVILLE,21692290,4729380,4901 RICHARD STREET,JACKSONVILLE,FL,DUVAL,4,2018
2,221300,MARTHAS VINEYARD HOSPITAL,24038629,84069106,ONE HOSPITAL ROAD,OAK BLUFFS,MA,DUKES,2,2018
3,201302,LINCOLNHEALTH,37577199,85209941,6 ST. ANDREWS LANE,BOOTHBAY HARBOR,ME,LINCOLN,2,2018
4,511306,ROANE GENERAL HOSPITAL,10970176,35195713,200 HOSPITAL DRIVE,SPENCER,WV,ROANE,2,2018


In [2]:
# Check the current data types of each column
cost_reports.dtypes

CCN                   object
Hospital_Name         object
Total_Charges         object
Total_Payment         object
Address               object
City                  object
State                 object
County                object
Ownership             object
Report_Fiscal_Year     int64
dtype: object

## Data Type Standardization: Numeric Conversion of Key Fields

After loading the raw `costs_all` table from SQLite, we ensure that critical fields (`CCN`, `Total_Charges`, and `Total_Payment`) are stored as numeric types. We coerce any non-numeric entries to NaN and then drop those incomplete rows. Finally, we confirm the data types to ensure reliable downstream calculations of charge-to-payment ratios.

In [3]:
# Convert CCN, Total_Charges, and Total_Payment to numeric types
cost_reports["CCN"] = pd.to_numeric(cost_reports["CCN"], errors="coerce")
cost_reports["Total_Charges"] = pd.to_numeric(cost_reports["Total_Charges"], errors="coerce")
cost_reports["Total_Payment"] = pd.to_numeric(cost_reports["Total_Payment"], errors="coerce")

# Drop any rows where conversion failed (NaN)
cost_reports = cost_reports.dropna(subset=["CCN", "Total_Charges", "Total_Payment"])

# Check dtypes again to confirm
cost_reports.dtypes

CCN                    int64
Hospital_Name         object
Total_Charges          int64
Total_Payment          int64
Address               object
City                  object
State                 object
County                object
Ownership             object
Report_Fiscal_Year     int64
dtype: object

## Persist Cleaned Cost Reports to CSV

We create a directory for cleaned outputs (if it doesn’t already exist) and then write the fully cleaned `cost_reports` DataFrame—containing numeric CCNs, charges, and payments—out to a CSV file (`hospital_costs_sqlite_cleaned.csv`). This ensures we have a standardized, analysis-ready file for all subsequent notebooks.

In [4]:
import os
os.makedirs("../data/cleaned", exist_ok=True)

cleaned_csv = "../data/cleaned/hospital_costs_sqlite_cleaned.csv"
cost_reports.to_csv(cleaned_csv, index=False)
print(f"Cleaned CSV written → {cleaned_csv}")

Cleaned CSV written → ../data/cleaned/hospital_costs_sqlite_cleaned.csv
