<a id='Exploratory Data Analysis'></a>

## <span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">  Exploratory Data Analysis
    

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">1.1 Load the libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">1.2. Import the dataset (HH_Provider_Apr2025.csv):

In [2]:
data =  pd.read_csv('../data/HH_Provider_Apr2025.csv') # Import the dataset named 'HH_Provider_Apr2025.csv'
data.head()  # view the first 5 rows of the data

Unnamed: 0,State,CMS Certification Number (CCN),Provider Name,Address,City/Town,ZIP Code,Telephone Number,Type of Ownership,Offers Nursing Care Services,Offers Physical Therapy Services,...,PPH Denominator,PPH Observed Rate,PPH Risk-Standardized Rate,PPH Risk-Standardized Rate (Lower Limit),PPH Risk-Standardized Rate (Upper Limit),PPH Performance Categorization,Footnote for PPH Risk-Standardized Rate,"How much Medicare spends on an episode of care at this agency, compared to Medicare spending across all agencies nationally","Footnote for how much Medicare spends on an episode of care at this agency, compared to Medicare spending across all agencies nationally","No. of episodes to calc how much Medicare spends per episode of care at agency, compared to spending at all agencies (national)"
0,AK,27001,PROVIDENCE HOME HEALTH ALASKA,"4001 DALE STREET, SUITE 101",ANCHORAGE,99508,9075630130,NON-PROFIT,Yes,Yes,...,455,10.11,8.37,6.56,10.49,Same As National Rate,-,0.96,-,1345
1,AK,27002,HOSPICE & HOME CARE OF JUNEAU,1803 GLACIER HIGHWAY,JUNEAU,99801,9074633113,NON-PROFIT,Yes,Yes,...,-,-,-,-,-,-,This measure currently does not have data or p...,0.71,-,151
2,AK,27006,FAIRBANKS MEMORIAL HOSPITAL HHA,1701 GILLAM WAY,FAIRBANKS,99701,9074585410,NON-PROFIT,Yes,Yes,...,134,5.97,7.43,4.41,11.42,Same As National Rate,-,0.77,-,439
3,AK,27008,ANCORA HOME HEALTH & HOSPICE,3831 E BLUE LUPINE DRIVE,WASILLA,99654,9075610700,PROPRIETARY,Yes,Yes,...,949,6.32,6.85,5.42,8.59,Better Than National Rate,-,0.85,-,2406
4,AK,27009,PETERSBURG MEDICAL CENTER HOME,P.O. BOX 589,PETERSBURG,99833,9077724291,GOVERNMENT OPERATED,Yes,Yes,...,62,8.06,11.06,6.71,18.52,Same As National Rate,-,1.09,-,232


<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">1.3 Check the dimension of the data.

In [3]:
data.shape # see the shape of the data

(12069, 96)

**There are 12069 Observations / Rows and 96 Attributes / Columns.**

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">1.4 Check the Information about the data and the datatypes of each respective attributes.

In [4]:
data.info() # To see the data type of each of the variable, number of values entered in each of the variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12069 entries, 0 to 12068
Data columns (total 96 columns):
 #   Column                                                                                                                                    Non-Null Count  Dtype 
---  ------                                                                                                                                    --------------  ----- 
 0   State                                                                                                                                     12069 non-null  object
 1   CMS Certification Number (CCN)                                                                                                            12069 non-null  int64 
 2   Provider Name                                                                                                                             12069 non-null  object
 3   Address                                                 

**There are two numerical variables and ninety four categorical variables.**

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">1.5 Check for missing data.

In [5]:
# Check for missing values
print("Missing values per column:")
print(data.isnull().sum().sort_values(ascending=False))


Missing values per column:
State                                                                                                                                       0
CMS Certification Number (CCN)                                                                                                              0
Provider Name                                                                                                                               0
Address                                                                                                                                     0
City/Town                                                                                                                                   0
                                                                                                                                           ..
PPH Performance Categorization                                                                                           

<span style="font-family: Arial; font-weight:bold;font-size:1.5em;color:#00b3e5;">1.6 Summary of Initial EDA: show shape, data types, and basic info.

In [6]:
# Initial EDA: show shape, data types, and basic info
initial_eda = {
    "Shape": data.shape,
    "Column Names": data.columns.tolist(),
    "Data Types": data.dtypes,
    "Missing Values": data.isnull().sum().sort_values(ascending=False),
    "Sample Data": data.head(3)
}

initial_eda

{'Shape': (12069, 96),
 'Column Names': ['State',
  'CMS Certification Number (CCN)',
  'Provider Name',
  'Address',
  'City/Town',
  'ZIP Code',
  'Telephone Number',
  'Type of Ownership',
  'Offers Nursing Care Services',
  'Offers Physical Therapy Services',
  'Offers Occupational Therapy Services',
  'Offers Speech Pathology Services',
  'Offers Medical Social Services',
  'Offers Home Health Aide Services',
  'Certification Date',
  'Quality of patient care star rating',
  'Footnote for quality of patient care star rating',
  "Numerator for how often the home health team began their patients' care in a timely manner",
  "Denominator for how often the home health team began their patients' care in a timely manner",
  "How often the home health team began their patients' care in a timely manner",
  "Footnote for how often the home health team began their patients' care in a timely manner",
  'Numerator for how often the home health team determined whether patients received a flu s

<span style="font-family: Arial; font-weight:bold;font-size:1.7em;color:#00b3e5;">Key Features to Use
### Use a subset of the dataset representing cost, quality, and care outcomes:

| Feature                               | Type                                        | Purpose                               |
| ------------------------------------- | ------------------------------------------- | ------------------------------------- |
| `How much Medicare spends...`         | Numeric                                     | Proxy for **cost**                    |
| `Quality of patient care star rating` | Numeric                                     | Proxy for **quality**                 |
| `PPH Risk-Standardized Rate`          | Numeric                                     | Proxy for **avoidable hospital use**  |
| `DTC Risk-Standardized Rate`          | Numeric                                     | Care transition outcomes              |
| `PPR Risk-Standardized Rate`          | Numeric                                     | Readmission rates                     |
| `Ownership Type`                      | Categorical → One-hot                       | For subgroup analysis post-clustering |
| `State`                               | Categorical                                 | For region-based visual analysis      |
| `No. of episodes to calc...`          | Numeric                                     | Episode Count                         |
| `Provider Name`                       | Categorical                                 | Name of Provider                      |


<span style="font-family: Arial; font-weight:bold;font-size:1.7em;color:#00b3e5;">Feature Selection and Data Cleaning

In [7]:
# ---- Step 1: Initial Cleanup ----
# Remove footnote columns, raw numerators/denominators, addresses, and contact details
columns_to_drop = [col for col in data.columns if any(keyword in col for keyword in [
    'Footnote', 'Numerator', 'Denominator', 'Address', 'Telephone', 'ZIP', 'City/Town',
    'Transfer of Health Information', 'Certification Date', 'Offers', 'CMS Certification'
])]

data.drop(columns=columns_to_drop, inplace=True, errors='ignore')

# ---- Step 2: Rename columns for clarity ----
data.rename(columns={
    'Provider Name': 'Provider',
    'Type of Ownership': 'Ownership',
    'Quality of patient care star rating': 'Quality_Star',
    'DTC Risk-Standardized Rate': 'DTC_Rate',
    'PPR Risk-Standardized Rate': 'PPR_Rate',
    'PPH Risk-Standardized Rate': 'PPH_Rate',
    'How much Medicare spends on an episode of care at this agency, compared to Medicare spending across all agencies nationally': 'Medicare_Cost_Ratio',
    'No. of episodes to calc how much Medicare spends per episode of care at agency, compared to spending at all agencies (national)': 'Episode_Count'
}, inplace=True)

# ---- Step 3: Clean numeric columns ----
numeric_cols = ['Quality_Star', 'DTC_Rate', 'PPR_Rate', 'PPH_Rate', 'Medicare_Cost_Ratio', 'Episode_Count']
for col in numeric_cols:
    data[col] = data[col].astype(str).str.replace(',', '').replace('-', pd.NA)
    data[col] = pd.to_numeric(data[col], errors='coerce')

# ---- Step 4: Drop rows with critical missing values ----
data.dropna(subset=numeric_cols, inplace=True)

# ---- Step 5: Summary of cleaned dataset ----
print("Cleaned DataFrame Shape:", data.shape)
print("Remaining Columns:", data.columns.tolist())
print("\nMissing Values Summary:\n", data.isnull().sum())
data.head()

Cleaned DataFrame Shape: (5797, 32)
Remaining Columns: ['State', 'Provider', 'Ownership', 'Quality_Star', "How often the home health team began their patients' care in a timely manner", 'How often the home health team determined whether patients received a flu shot for the current flu season', 'How often patients got better at walking or moving around', 'How often patients got better at getting in and out of bed', 'How often patients got better at bathing', "How often patients' breathing improved", 'How often patients got better at taking their drugs correctly by mouth', 'Changes in skin integrity post-acute care: pressure ulcer/injury', 'How often physician-recommended actions to address medication issues were completely timely', 'Percent of Residents Experiencing One or More Falls with Major Injury', 'Discharge Function Score', 'DTC Observed Rate', 'DTC_Rate', 'DTC Risk-Standardized Rate (Lower Limit)', 'DTC Risk-Standardized Rate (Upper Limit)', 'DTC Performance Categorization', 'PP

Unnamed: 0,State,Provider,Ownership,Quality_Star,How often the home health team began their patients' care in a timely manner,How often the home health team determined whether patients received a flu shot for the current flu season,How often patients got better at walking or moving around,How often patients got better at getting in and out of bed,How often patients got better at bathing,How often patients' breathing improved,...,PPR Risk-Standardized Rate (Lower Limit),PPR Risk-Standardized Rate (Upper Limit),PPR Performance Categorization,PPH Observed Rate,PPH_Rate,PPH Risk-Standardized Rate (Lower Limit),PPH Risk-Standardized Rate (Upper Limit),PPH Performance Categorization,Medicare_Cost_Ratio,Episode_Count
0,AK,PROVIDENCE HOME HEALTH ALASKA,NON-PROFIT,4.5,91.3,42.3,91.7,94.5,96.2,100.0,...,3.11,5.09,Same As National Rate,10.11,8.37,6.56,10.49,Same As National Rate,0.96,1345.0
2,AK,FAIRBANKS MEMORIAL HOSPITAL HHA,NON-PROFIT,2.5,84.9,53.9,74.4,84.7,82.6,89.5,...,2.61,5.05,Same As National Rate,5.97,7.43,4.41,11.42,Same As National Rate,0.77,439.0
3,AK,ANCORA HOME HEALTH & HOSPICE,PROPRIETARY,4.0,69.7,48.1,87.0,92.7,93.2,93.4,...,3.56,5.25,Same As National Rate,6.32,6.85,5.42,8.59,Better Than National Rate,0.85,2406.0
4,AK,PETERSBURG MEDICAL CENTER HOME,GOVERNMENT OPERATED,1.5,84.7,76.9,70.1,75.2,64.7,72.8,...,2.62,5.86,Same As National Rate,8.06,11.06,6.71,18.52,Same As National Rate,1.09,232.0
5,AK,SOUTH PENINSULA HOSPITAL HHA,GOVERNMENT OPERATED,2.5,97.8,45.7,72.4,84.9,78.8,72.5,...,2.5,5.31,Same As National Rate,7.2,8.82,5.71,13.64,Same As National Rate,1.23,335.0


**There are now 5797 Observations / Rows and 32 Attributes / Columns and there are no missing values.**

**Note** All rows with missing values were deleted.
The columns we’re using for modeling (e.g., Quality_Star, PPH_Rate, DTC_Rate, PPR_Rate, Medicare_Cost_Ratio, Episode_Count) are core numerical features.
Missing values in these columns means that either the agency didn’t report a key metric or CMS didn’t have enough data to calculate it.
Including incomplete rows in unsupervised models like KMeans can distort results because clustering requires distance calculations (e.g., Euclidean), which can’t be computed if values are missing. Imputing the missing values incorrectly may introduce bias or false similarity.

# We will stop here since we already know that the project is to create an unsupervised model.