## Task 6.5: Unsupervised Machine Learning: Clustering

Part 1. Importing and data checking. 

Part 2. Elbow curve

Part 3. Clustering

Part 4. Updated conclution

### Part 1
#### Importing libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import os
import sklearn
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
print("All libraries are ok")

In [None]:
# This option ensures that the graphs you create are displayed within the notebook without the need to "call" them specifically.

%matplotlib inline

In [None]:
# Getting the file #setting the path to get the file

file = '/Users/emilsafarov/Library/CloudStorage/OneDrive-Personal/CF/A6_Healthcare_Access_US/1_sourceing_open_data/US project/Data/Output/final_dataset_v3.csv'

# reading the file
df = pd.read_csv(file)

# checing the file 

df.head(5)

To explore the relationship visually, a unified measure of healthcare access was created. Five indicators—PC Physicians, Mental Health Providers, Dentists, Flu Vaccination, and Mammography Screening—were standardized using z-score transformation. This method adjusted each variable to a mean of zero and standard deviation of one, ensuring comparability across different units and scales. Standardization eliminated scale bias and allowed each indicator to contribute equally to the overall score.

In [None]:
healthcare_columns = [
    # 🏥 Healthcare Access & Resources
    "primary_care_physicians__primary_care_physicians_mean",
    "mental_health_providers_raw_val__mental_health_providers_mean",
    "dentists_raw_value__dentists_mean",
    "flu_vaccinations_raw_value__flu_vaccinations_mean",
    "mammography_screening__mammography_screening_mean"]

# Standardize healthcare-related variables using z-score transformation
# Z-score formula: z = (x - mean) / standard deviation
# This rescales each variable to have a mean of 0 and a standard deviation of 1,
# ensuring comparability across different units (e.g., per capita counts vs. percentages)

from scipy.stats import zscore

# Apply z-score standardization to each column
df_scaled_healthcare = df[healthcare_columns].apply(zscore)

# (Optional) Rename columns to indicate they are standardized
df_scaled_healthcare.columns = [col + " (z-score)" for col in df_scaled_healthcare.columns]

In [None]:
# Define policy-aligned weights for each z-scored healthcare indicator
# Values chosen based on public health emphasis (e.g., WHO, CDC, AHRQ priorities)

weights = {
    "primary_care_physicians__primary_care_physicians_mean (z-score)": 0.30,   # Core access point in all health systems
    "mental_health_providers_raw_val__mental_health_providers_mean (z-score)": 0.25,  # Underserved and critical post-pandemic
    "dentists_raw_value__dentists_mean (z-score)": 0.20,  # Preventive care with medium priority
    "flu_vaccinations_raw_value__flu_vaccinations_mean (z-score)": 0.15,  # Population-level disease prevention
    "mammography_screening__mammography_screening_mean (z-score)": 0.10  # Targeted screening (age and gender-specific)
}

# Apply the weights to each standardized column
df_weighted_healthcare = df_scaled_healthcare.copy()
for col in df_weighted_healthcare.columns:
    df_weighted_healthcare[col] = df_weighted_healthcare[col] * weights[col]

# Calculate the final composite score (weighted average of z-scores)
df["Healthcare Access (Weighted Z-score Composite)"] = df_weighted_healthcare.sum(axis=1)


In [None]:
# Creating a sub-dataset under the main df 
# Organized into thematic categories: Healthcare Access, Socioeconomic, Environmental

selected_columns = [
    # Healthcare Access & Resources
    "Healthcare Access (Weighted Z-score Composite)",                # Dependent(y) variabl 
    "uninsured_raw_value__uninsured_mean",                          # Included cautiously; typically an outcome variable

    # Socioeconomic Predictors
    "median_income__median_income_mean",
    "children_poverty__children_poverty_mean",
    "unemployment__unemployment_mean",
    "income_inequality__income_inequality_mean",
    "Diversity Index",

    # Environmental & Lifestyle Predictors
    "exercise_access__exercise_access_mean",
    "food_insecurity__food_insecurity_mean"  # Included for now; will monitor collinearity with Child Poverty
]

# Rename columns for better readability
df_healthcare_predictors = df[selected_columns].rename(columns={
    "Healthcare Access (Weighted Z-score Composite)": "Healthcare Access Score",
    "uninsured_raw_value__uninsured_mean": "Uninsured",
    "median_income__median_income_mean": "Median Income",
    "children_poverty__children_poverty_mean": "Child Poverty",
    "unemployment__unemployment_mean": "Unemployment",
    "income_inequality__income_inequality_mean": "Income Inequality",
    "Diversity Index": "Diversity Index",
    "exercise_access__exercise_access_mean": "Exercise Access",
    "food_insecurity__food_insecurity_mean": "Food Insecurity"
})

df_healthcare_predictors.reset_index(drop=True, inplace=True)

df_healthcare_predictors.head() 

In [None]:
# Standardizing the selected healthcare-related predictors to ensure all features 
# contribute equally to the k-means algorithm. This is necessary because k-means 
# relies on distance calculations, and unscaled variables with larger ranges 
# (like income) can bias the results.

from sklearn.preprocessing import StandardScaler

# Step 1: Define the features to scale
features = ['Uninsured', 'Median Income', 'Child Poverty', 'Unemployment',
            'Income Inequality', 'Diversity Index', 'Exercise Access', 'Food Insecurity']

# Step 2: Initialize the scaler
scaler = StandardScaler()

# Step 3: Fit and transform the selected features
scaled_features = scaler.fit_transform(df_healthcare_predictors[features])

# Step 4: Convert back to DataFrame and include 'Healthcare Access Score'
df_scaled = pd.DataFrame(scaled_features, columns=features)
df_scaled['Healthcare Access Score'] = df_healthcare_predictors['Healthcare Access Score'].values

# creating df_healthcare_predictors_scaled = df_hps with the conbination of df_scaled and already exsisting column
# "Healthcare Access Score" from df_healthcare_predictors datagram

In [None]:
df_hps = df_scaled

In [None]:
df_hps.head()

In [None]:
df_hps.corr().round(3)

In [None]:
# Step 1: Manually input correlation values (from df_hps.corr().round(3))
corr_income = 0.419
corr_exercise = 0.553

# Step 2: Normalize the correlation values to get weights
total = corr_income + corr_exercise
w_income = corr_income / total     # ≈ 0.431
w_exercise = corr_exercise / total # ≈ 0.569

# Step 3: Create the weighted combination of the two predictors
df_hps["Combined Score: Income + Exercise"] = (
    w_income * df_hps['Median Income'] +
    w_exercise * df_hps['Exercise Access']
)


### Part 2 
#### Elbow curve 

In [None]:
from sklearn.cluster import KMeans

# Step 1: Define the range of cluster numbers to try
num_cl = range(1, 10)

# Step 2: Create a list of KMeans models, one for each k
kmeans = [KMeans(n_clusters=i, random_state=42) for i in num_cl]

# Step 3: Fit each model to your scaled healthcare data
score = [kmeans[i].fit(df_scaled).score(df_scaled) for i in range(len(kmeans))]


In [None]:
score

In [None]:
import matplotlib.pyplot as plt

plt.plot(num_cl, score)
plt.xlabel("Number of Clusters")
plt.ylabel("Score")  # fixed typo here
plt.title("Elbow Curve")
plt.show()


### Part 3 
#### Clustering

In [None]:
# Creating k-means object based on # 5 (also might be 6)

kmeans = KMeans(n_clusters = 5)

In [None]:
# fitting the object to df

kmeans.fit(df_hps)

In [None]:
df_hps["Clusters"] = kmeans.fit_predict(df_hps)

In [None]:
df_hps.head(10)

In [None]:
df_hps["Clusters"].value_counts()

In [None]:
# Ploting different conbinations

# Healcare Access Score vs Median Income

plt.figure(figsize=(12, 8))

# Scatterplot for Healthcare Access vs Median Income
ax = sns.scatterplot(
    y=df_hps['Healthcare Access Score'],
    x=df_hps['Median Income'],
    hue=df_hps['Clusters'],  # uses the cluster labels from KMeans
    palette='Set1',         # optional: improves color distinction
    s=40                   # size of the points
)

ax.grid(False)  # Removes the background grid

# Axis labels
plt.ylabel('Healthcare Access Score')
plt.xlabel('Median Income')
plt.title('Clusters: Healthcare Access vs. Median Income')

plt.show()


In [None]:
plt.figure(figsize=(12, 8))

# Scatterplot for Healthcare Access vs Exercise Access
ax = sns.scatterplot(
    y=df_hps['Healthcare Access Score'],
    x=df_hps['Exercise Access'],
    hue=df_hps['Clusters'],  # cluster labels
    palette='Set1',
    s=40
)

ax.grid(False)

# Axis labels
plt.ylabel('Healthcare Access Score')
plt.xlabel('Exercise Access')
plt.title('Clusters: Healthcare Access vs. Exercise Access')

plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))

# Scatterplot of Combined Score vs. Healthcare Access Score
sns.scatterplot(
    x=df_hps['Combined Score: Income + Exercise'],
    y=df_hps['Healthcare Access Score'],
    hue=df_hps['Clusters'],  # Optional: only if clustering labels are available
    palette='Set1',
    s=30,
    alpha=0.5  # Lower opacity
)

plt.title("Healthcare Access vs. Combined Income & Exercise Score")
plt.xlabel("Combined Score (Income + Exercise Access)")
plt.ylabel("Healthcare Access Score")
plt.grid(False)
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(10, 6))

# Scatterplot with regression line
sns.regplot(
    x=df_hps['Combined Score: Income + Exercise'],
    y=df_hps['Healthcare Access Score'],
    scatter_kws={'s': 40, 'color': 'gray'},
    line_kws={'color': 'blue', 'linewidth': 2}
)

plt.title("Linear Relationship: Combined Score vs. Healthcare Access")
plt.xlabel("Combined Score (Income + Exercise Access)")
plt.ylabel("Healthcare Access Score")
plt.grid(False)
plt.tight_layout()
plt.show()


In [None]:
df_hps.loc[df_hps['Clusters'] == 4, 'Cluster_color'] = 'Orange'
df_hps.loc[df_hps['Clusters'] == 3, 'Cluster_color'] = 'Purple'
df_hps.loc[df_hps['Clusters'] == 2, 'Cluster_color'] = 'Green'
df_hps.loc[df_hps['Clusters'] == 1, 'Cluster_color'] = 'Blue'
df_hps.loc[df_hps['Clusters'] == 0, 'Cluster_color'] = 'Red'

In [None]:
grouped_df = df_hps.groupby('Cluster_color').agg({
    'Uninsured': ['mean'],
    'Median Income': ['mean'],
    'Child Poverty': ['mean'],
    'Unemployment': ['mean'],
    'Income Inequality': ['mean'],
    'Diversity Index': ['mean'],
    'Exercise Access': ['mean'],
    'Food Insecurity': ['mean'],
    'Healthcare Access Score': ['mean']
})


In [None]:
grouped_df = grouped_df.round(2)
grouped_df

In [None]:
# Only keep the 'mean' level from the multi-level columns
mean_df = grouped_df.xs('mean', axis=1, level=1)

# Calculate correlation matrix
corr_matrix = mean_df.corr()

# Plot the heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    square=True,
    cbar_kws={"shrink": 0.8}
)

plt.title("Correlation Matrix of Cluster Means", fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


In [None]:
# Define the updated grouped_df using actual values
grouped_df = pd.DataFrame({
    'Uninsured': [-0.397902, -0.543269, 1.203235, -0.439616, 0.419255],
    'Median Income': [0.260081, 1.539922, -0.448048, -0.372245, -1.150279],
    'Child Poverty': [-0.682097, -0.985769, 0.488263, 0.167991, 1.769116],
    'Unemployment': [-0.730302, -0.196805, -0.225307, 0.584774, 1.276333],
    'Income Inequality': [-0.679115, -0.277183, 0.189672, 0.049841, 1.582620],
    'Diversity Index': [-0.754979, 0.659012, 0.623806, -0.455556, 0.745499],
    'Exercise Access': [-0.186464, 1.142738, -0.431512, 0.173656, -0.649840],
    'Food Insecurity': [-0.773205, -0.895123, 0.476061, 0.314980, 1.598664],
    'Healthcare Access Score': [-0.054132, 0.760178, -0.349304, 0.040969, -0.342689]
}, index=['Blue', 'Green', 'Orange', 'Purple', 'Red'])

# Define cluster color map
color_map = {
    'Red': '#e41a1c',
    'Green': '#4daf4a',
    'Orange': '#ff7f00',
    'Purple': '#984ea3',
    'Blue': '#377eb8'
}

# Plot radar chart with fixed color legend
def plot_radar(df, title='Radar Chart of Cluster Means'):
    labels = df.columns
    num_vars = len(labels)
    angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
    angles += angles[:1]

    fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))

    for index, row in df.iterrows():
        values = row.tolist() + [row.tolist()[0]]
        ax.plot(angles, values, label=index, color=color_map[index])
        ax.fill(angles, values, color=color_map[index], alpha=0.1)

    ax.set_title(title, size=14)
    ax.set_theta_offset(np.pi / 2)
    ax.set_theta_direction(-1)
    ax.set_thetagrids(np.degrees(angles[:-1]), labels)
    ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.1))
    plt.tight_layout()
    plt.show()

# Plot bar chart with matching colors
def plot_bar(df, title='Bar Chart of Cluster Means'):
    df.T.plot(kind='bar', figsize=(12, 6), color=[color_map[c] for c in df.index])
    plt.title(title)
    plt.ylabel('Mean Z-score')
    plt.xlabel('Indicators')
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Cluster Color')
    plt.tight_layout()
    plt.show()

# Run both plots
plot_radar(grouped_df)
plot_bar(grouped_df)


In [None]:
grouped_df.head()

#### Observations from the K-Means Unsupervised Method
The elbow method indicated that the optimal number of clusters is k = 5. However, for simplicity and easier interpretation of the cluster plots, using four clusters could also be considered. While this reduces complexity, it may slightly decrease the analytical precision.

Two sets of cluster plots were created to explore the relationships across the five clusters. These visualizations show patterns between Healthcare Access vs. Median Income (which reveals a clearer linear trend) and Healthcare Access vs. Exercise Access.

Given that the dataset is clean and reliable, mean values were used to construct the cluster profile table. This table summarizes the dimensions of each cluster. In addition, a correlation matrix was generated based on cluster means, revealing consistent relationships between several indicators.

To visually characterize each cluster, radar and bar charts were extremely effective and intuitive. The radar charts illustrated the overall shape and strengths of each cluster across all indicators, while the bar charts offered a clear, side-by-side comparison. Together, these visual tools made the cluster profiles easy to understand and communicate.

To visually characterize each cluster, radar and bar charts were extremely effective and intuitive. The radar charts illustrated the overall shape and strengths of each cluster across all indicators, while the bar charts offered a clear, side-by-side comparison. Together, these visual tools made the cluster profiles easy to understand and communicate.

| **Variables (correlation with Healthcare Access Score)** | **Corr. (Supervised Regression)** | **Corr. (Unsupervised K-Means)** |
| -------------------------------------------------------- | -------------------------------- | --------------------------------- |
| Median Income                                            | 0.42                             | 0.85                              |
| Exercise Access                                          | 0.55                             | 0.99                              |
| Food Insecurity                                          | -0.31                            | -0.69                             |
| Uninsured                                                | -0.38                            | -0.65                             |
| Child Poverty                                            | -0.37                            | -0.71                             |
| Unemployment                                             | -0.07                            | -0.27                             |
| Income Inequality                                        | -0.02                            | -0.46                             |
| Diversity Index                                          | 0.09                             | 0.09                              |



#### Clusters' characteristics

🔴 Red Cluster – High Risk, Underserved
The Red Cluster represents the most vulnerable counties in the dataset. These areas are marked by deep socioeconomic challenges, including widespread child poverty, high unemployment, and elevated levels of food insecurity and income inequality. Residents in these counties typically experience the lowest median incomes and severely limited access to healthcare services. This cluster reflects communities in urgent need of targeted support and policy intervention.

🟩 Green Cluster – Prosperous & Well-Served
The Green Cluster represents high-performing counties across the United States. These areas stand out for their strong socioeconomic and health profiles, with notably high median incomes, excellent healthcare access, and widespread exercise opportunities. Additionally, they exhibit low rates of food insecurity, child poverty, and uninsured individuals. Overall, this cluster reflects well-resourced communities with strong infrastructure and access to essential services—making them clear leaders in terms of community well-being.

🟧 Orange Cluster – Underinsured but Not Poor
The Orange Cluster includes counties where residents may not be economically deprived but still face serious gaps in healthcare access. These areas are defined by high uninsured rates and below-average healthcare access, despite having relatively modest income levels. The population may include working individuals without employer-sponsored insurance, possibly reflecting communities with large informal labor sectors or high immigrant populations.

🟣 Purple Cluster – Working-Age Unstable
The Purple Cluster is made up of counties with a mixed profile. While not facing the extremes seen in the Red Cluster, these areas still struggle with pockets of economic instability. They tend to have higher-than-average unemployment and moderately low income levels. These communities may represent younger or more diverse working-class populations navigating financial uncertainty, with uneven access to resources.

🔵 Blue Cluster – Balanced, Slightly Struggling
The Blue Cluster captures counties that fall near the average across most indicators. These areas show slightly lower income levels and healthcare access, but don’t display severe disadvantages. They may represent transitional or mid-tier communities—neither thriving nor in crisis—making them a useful reference point for evaluating broader trends.


#### Conclusion: 

Based on the analysis using both supervised regression and unsupervised K-means clustering, the findings suggest meaningful patterns in the relationship between key socioeconomic factors and healthcare access across U.S. counties.

However, to strengthen the validity of the patterns observed through unsupervised machine learning, additional statistical methods may be necessary. While the current results suggest a weak to moderate positive relationship between key variables and healthcare access, further testing could help raise the confidence level in the alternative hypothesis and potentially reveal a strong positive relationship. Incorporating additional predictors—such as educational attainment, healthcare provider density, or regional policy factors—could also enhance the model's explanatory power and provide a more comprehensive understanding of the determinants of healthcare access.

#### Current Hypothesis
#### Null Hypothesis (H₀):
There is no statistically significant relationship between median income, exercise access, food insecurity, or regional healthcare access.

#### Alternative Hypothesis (H₁):
There is a statistically weak to moderate positive relationship between Median income and exercise access and overall healthcare access across all counties of US.