# Cluster Census Block Groups

This notebook shows how use unsupervised machine learning to cluster block groups in California using American Community Survey 2017-2021 5-year data.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
census_data = pd.read_csv('census_data.csv')
census_data.head(3)

## Define Features

In [None]:
features = [
    'B01002_001E_Estimate_Median_age_--_Total',
    'B01001_026E_Estimate_Total_Female_P',
    'B03002_003E_Estimate_Total_Not_Hispanic_or_Latino_White_alone_P',
    'B03002_004E_Estimate_Total_Not_Hispanic_or_Latino_Black_or_African_American_alone_P',
    'B03002_006E_Estimate_Total_Not_Hispanic_or_Latino_Asian_alone_P',
    'B03002_012E_Estimate_Total_Hispanic_or_Latino_P',
    'B99162_002E_Estimate_Total_Speak_only_English_P',
    'C17002_002E_Estimate_Total_Under_.50_P',
    'B19013_001E_Estimate_Median_household_income_in_the_past_12_months_(in_2021_inflation-adjusted_dollars)',
    'B25003_003E_Estimate_Total_Renter_occupied_P',
]

## Explore the Data

In [None]:
census_data[features].describe().apply(lambda s: s.apply('{0:.2f}'.format))

In [None]:
census_data[census_data['B01002_001E_Estimate_Median_age_--_Total'] == -666666666]['B01002_001E_Estimate_Median_age_--_Total'].value_counts()

In [None]:
census_data[census_data['B19013_001E_Estimate_Median_household_income_in_the_past_12_months_(in_2021_inflation-adjusted_dollars)'] == -666666666]['B19013_001E_Estimate_Median_household_income_in_the_past_12_months_(in_2021_inflation-adjusted_dollars)'].value_counts()

In [None]:
census_data[features].isna().sum()

Median income and median age both have missing values indicated by the value '-666666666'. Additionally, the other features have null values constituting a small percent of the data. We set the missing values of income and age to null so that they are handled appropriately when we impute missing values later.

In [None]:
census_data_cleaned = census_data.copy()
for i in features:
    census_data_cleaned.loc[census_data_cleaned[i] == -666666666, i] = np.nan

To get a sense of how many clusters might be in the data, we will use principal component analysis (PCA) to summarize the data in two dimensions. We will use a random subset of the rows with complete data for this data visualization. First, we will normalize the features to be on the same scale. Then, we will use PCA to reduce the data to two principal components, and visualize the two components in a scatterplot.

In [None]:
census_data_pca = census_data_cleaned.dropna(axis=0)
census_data_pca = census_data_pca.sample(n=1000, random_state=1)

In [None]:
scaled_features = MinMaxScaler().fit_transform(census_data_pca[features])

In [None]:
pca = PCA(n_components=2).fit(scaled_features)
features_2d = pca.transform(scaled_features)

In [None]:
plt.scatter(features_2d[:,0],features_2d[:,1])
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Data')
plt.show()

There are potentially multiple clusters in the data, but it is difficult to tell based on visual inspection alone. Cluster analysis will help identify any potential clusters.

## Preprocess the Data

We will fill missing data with the median value. To prevent features with larger scales (such as median household income) from biasing the cluster results, we will normalize the data to a similar scale.

In [None]:
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(census_data_cleaned[features])

In [None]:
scaler = StandardScaler()
X_imputed_scaled = scaler.fit_transform(X_imputed)

## Run a Clustering Algorithm

Run models with 1 to 10 clusters, and visualize the within cluster sum of squares (WCSS) to identify the optimal number of clusters. We will use K-means.

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(X_imputed_scaled)
    wcss.append(kmeans.inertia_)
    
plt.plot(range(1, 11), wcss)
plt.title('WCSS by Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

It looks like there is an "elbow" inflection point around 2 or 3 clusters. We will proceed with 3 clusters.

In [None]:
model = KMeans(n_clusters=3, init='k-means++', n_init=100, max_iter=1000)
km_clusters = model.fit_predict(X_imputed_scaled)
pd.DataFrame(km_clusters).value_counts()

## Describe the Clusters

Now we will visualize the features by cluster. This will help with describing the characteristics of each cluster.

In [None]:
census_data_clustered = pd.concat([pd.DataFrame(X_imputed), pd.DataFrame(km_clusters)],axis=1)
census_data_clustered.columns = features + ['Cluster']

In [None]:
for i in features:
    means = census_data_clustered.groupby('Cluster')[i].mean()
    ax = means.plot(kind='bar', figsize=(3,2), color = 'lightgray')
    plt.xlabel('Cluster')
    plt.ylabel(f'Mean')
    plt.title(i)
    for i, v in enumerate(means):
        ax.text(i, v*.9, f"{v:.2f}", ha='center')
    plt.show()

- Cluster 0 has the highest average age, percent White, and percent speaking only English; and the lowest average percent renter-occupied housing units.


- Cluster 1 has the highest average percent Black, percent Hispanic, percent below 50% of poverty, and percent renter-occupied housing units; and the lowest average age, percent White, percent speaking only English, and median household income. 


- Cluster 2 has the highest average percent Asian, and median household income; and relatively low average percent White, percent Black, percent Hispanic, and percent speaking only English.

## Export the Data

Save the clusters for use in other analyses.

In [None]:
census_data_clustered_id = pd.concat([census_data_cleaned['GEO_ID_Geography'], census_data_clustered], axis = 1)

In [None]:
census_data_clustered_id.to_csv('census_data_clustered.csv', index=False)

## Useful Resources

https://learn.microsoft.com/en-us/training/modules/train-evaluate-cluster-models/3-exercise-model

https://learn.microsoft.com/en-us/training/modules/train-evaluate-cluster-models/5-exercise-new-models