# Clustering Census Data with Amazon DenseClus

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|denseclus|Amazon_DenseClus_Clustering_Census.ipynb)

---

In this demo notebook, we demonstrate how to use [Amazon DenseClus](https://github.com/awslabs/amazon-denseclus) for mixed-type clustering. Mixed-type clustering is the task of grouping unlabeled data containing both numerical and categorical features into clusters of points, preserving similiarity within the cluster and dissimiliarity between other clusters.

1. [Initial Setup](#Initial-Setup)
2. [Load Data](#load-data)
3. [Why DenseClus?](#build-clusters-with-numeric-and-categorical-features-separately)
4. [DenseClus in Action](#denseclus-on-all-features-numerical--categorical)
5. [Analysis](#Analysis)
6. [Conclusion](#Conclusion)

## Initial Setup

Before running the notebook there is some intial setup we must do: lets make sure we have all of the neccesary libraries installed

In [None]:
! pip install amazon-denseclus==0.2.2 --upgrade --quiet
! pip install sagemaker seaborn ipywidgets --upgrade --quiet

In [None]:
import sagemaker
import hdbscan
import numpy as np
import pandas as pd
import seaborn as sns
import umap.umap_ as umap

from denseclus import DenseClus
from denseclus.categorical import extract_categorical
from denseclus.numerical import extract_numerical

SEED = 42  # random seed to set reproducibility as best we can

sns.set_style("darkgrid")
sns.set_context("notebook")

%matplotlib inline

## Load Data
Let's start by downloading publicly available *Census Income dataset* available at https://archive.ics.uci.edu/ml/datasets/Adult. In this dataset we have different attributes such as age, work class, education, country, race etc for each person. We also have an indicator of person's income being more than $50K a year. The prediction task is to determine whether a person makes over 50K a year.


### Data Description
Let's talk about the data. At a high level, we can see:

- There are 15 columns and around 32K rows in the training data
- 8 of the 14 features are categorical and remaining 6 are numeric

Now lets read this into a Pandas data frame and take a look.

In [None]:
s3_downloader = sagemaker.s3.S3Downloader()
region = sagemaker.Session().boto_region_name

## read the data
s3_downloader.download(
    s3_uri=f"s3://sagemaker-example-files-prod-{region}/datasets/tabular/uci_adult/adult.data",
    local_path='.'
)

df = pd.read_csv("adult.data", header=None)

## set column names
df.columns = [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "IncomeGroup",
]

df.head()

As you can see the data consists of both categorical and numeric features.
Generally, speaking this is problematic for traditional dimension reduction and clustering methods such as [K-Means](https://en.wikipedia.org/wiki/K-means_clustering) as they rely input features to be numeric and assume that the values are shaped spherical in nature.

## Build clusters with numeric and categorical features separately

What happens if we treat the numerical and categorical features as separate datasts and cluster them individually? Lets explore.

We can use UMAP - a popular dimensionality reduction algorithm, coupled with HDBScan, a density-based clustering algorithm.

In [None]:
default_umap_params = {
    "categorical": {
        "metric": "hamming",
        "n_neighbors": 30,
        "n_components": 5,
        "min_dist": 0.0,
    },
    "numerical": {
        "metric": "l2",
        "n_neighbors": 30,
        "n_components": 5,
        "min_dist": 0.0,
    },
}

hdbscan_params = {
    "min_cluster_size": 100,
    "min_samples": 15,
    "gen_min_span_tree": True,
    "metric": "euclidean",
}

### Numerical features

In [None]:
numerical_df = extract_numerical(df)

numerical_umap = umap.UMAP(
    random_state=SEED,
    n_jobs=1,
    verbose=False,
    low_memory=True,
    **default_umap_params["numerical"],
).fit(numerical_df)

numerical_hdb = hdbscan.HDBSCAN(**hdbscan_params).fit(numerical_umap.embedding_)

joint_plot = sns.jointplot(
    x=numerical_umap.embedding_[:, 0],
    y=numerical_umap.embedding_[:, -1],
    hue=numerical_hdb.labels_,
    kind="kde",
    marginal_ticks=True,
)

In [None]:
numerical_df = df.select_dtypes(include=[int, float])
numerical_df["segment"] = numerical_hdb.labels_

numerical_df.groupby(["segment"]).agg(["mean", "median"])

From the above plot and descriptive stats generated from the clusters using only numerical features, we can see that:

- Cluster 1 seems to represent a younger demographic with lower education, income (fnlwgt), and working hours. This could be students or entry-level workers.
- Clusters 0, 1, 2, 4, 7, and 9 seem to represent middle-aged working professionals with higher education, income, and working hours.
- Cluster 5 clearly captures a very young demographic, likely teenagers/students based on the low age and hours worked. 
- Cluster 6 seems to be early career adults with moderate education and income.
- Cluster 8 represents older, retired people with lower education, income, and hours worked. 

Some limitations of using just these numerical features:

- We lack contextual categorical features like occupation, marital status, etc that would help distinguish the clusters more clearly.
- The clusters are formed solely based on central tendencies of numerical features. But subgroups within a cluster can exhibit very different characteristics. 
- Important outliers in the data may not be well-represented in the clusters.
- Numerical features alone may not fully capture more complex behavioral segments.

So while these numeric features give some interesting initial profiles, more categorical and behavioral features would likely result in more nuanced and meaningful customer segments.

### Categorical features

In [None]:
categorical_df = extract_categorical(df)

categorical_umap = umap.UMAP(
    random_state=SEED,
    n_jobs=1,
    verbose=False,
    low_memory=True,
    **default_umap_params["categorical"],
).fit(categorical_df)

categorical_hdb = hdbscan.HDBSCAN(**hdbscan_params).fit(categorical_umap.embedding_)
n_clusters = len(np.unique(categorical_hdb.labels_))
print(f"Number of clusters: {n_clusters}")

In [None]:
joint_plot = sns.jointplot(
    x=categorical_umap.embedding_[:, 0],
    y=categorical_umap.embedding_[:, -1],
    hue=categorical_hdb.labels_,
    kind="kde",
    marginal_ticks=True,
)
joint_plot.ax_joint.legend_.remove()
joint_plot.set_axis_labels("umap_embeddings: 0", "umap_embeddings: -1")

In [None]:
categorical_df = df.select_dtypes(include=[object])
categorical_df["segment"] = categorical_hdb.labels_

categorical_df.groupby("segment").describe(include="object").tail(10)

From the above plot generated from the clusters using only categorical features, we can see that:

- 114 clusters formed from categorical features lead to many clusters that are very similar to each other.
- High sparsity. Categorical features often lead to high-dimensional sparse feature spaces, which makes clustering statistically less meaningful. With 114 clusters in a large sparse space, most of the clusters have very few data points.
- Difficult to interpret clusters - Clusters formed on categorical features alone are difficult to characterize and interpret. 

In summary, clustering on sole categorical features often lacks clear semantic meaning and differentiation between clusters. It tends to work better when combined with numeric features that provide more inherent structure. 

## DenseClus on all features (numerical + categorical)

With DenseClus this is not an issue because we use create UMAP embeddings for both categorical and numerical, combining the embedding space to output them into the densest space possible. Next HDBSCAN is run to group densities into clusters, resulting in groups of mixed-type data. 

All of this is done under the hood and just requires a `fit` call like below.

### There are 5 methods by which you can combine embeddings spaces (param: umap_combine_method, default=intersection)

- 'intersection'
- 'union'
- 'contrast'
- 'intersection_union_mapper'
- 'ensemble'

----------

Let's implement each one of the above methods and visualize their respective umap embeddings. 

In [None]:
def plot_clusters(method, embeddings, clusters):
    """display KDE plots for given embeddings/clusters"""

    joint_plot = sns.jointplot(
        x=embeddings[:, 0],
        y=embeddings[:, -1],
        hue=clusters,
        kind="kde",
        marginal_ticks=True,
    )
    joint_plot.set_axis_labels("umap_embeddings: 0", "umap_embeddings: -1")

    joint_plot.figure.suptitle(f"Method: {method}")
    if n_clusters > 50:
        joint_plot.ax_joint.legend_.remove()

In [None]:
methods = ["intersection", "union", "contrast", "intersection_union_mapper", "ensemble"]

hdbscan_params = {
    "min_cluster_size": 200,
    "min_samples": 30,
    "gen_min_span_tree": True,
    "metric": "manhattan",
    "cluster_selection_method": "eom",
}

for method in methods:
    print(f"Running combination method: {method}")
    clf = DenseClus(random_state=SEED, umap_combine_method=method)
    clf.fit(df)
    clusters = clf.evaluate()
    n_clusters = len(np.unique(clusters))
    print(f"Number of clusters: {n_clusters}")

    if hasattr(clf, "mapper_"):
        embeddings = clf.mapper_.embedding_
    else:
        embeddings = clf.numerical_umap_.embedding_

    plot_clusters(method, embeddings, clusters)

    print("-" * 30)

----------
As a recap the steps that happened are:

1) Numerical features were taken out and then reduced into a *dense* UMAP embedding

2) Categorical features got extracted and learned into a *dense* separate UMAP embedding

3) The two embeddings were then combined with one of the available operations ("intersection", "union", "contrast", "intersection_union_mapper", "ensemble")

4) HDBSCAN uses density-based spatial clustering to hierarchical-fashion to extract clusters from the combined space

The features and embeddings are now exposed on the `DenseClus` object. 

----------
## Analysis:

Based on the above results for each method: `intersection_union_mapper` seems to be the suitable method for clustering our dataset. As you can see we have 5 distinct islands formed within the slice of the data. Clusters have formed around these densities which is exactly the behavior we expect DenseClus to do.

`Intersection_union_mapper` is a hybrid method that combines the strengths of both 'intersection' and 'union'. It first applies the 'intersection' method to preserve the numerical embeddings, then applies the 'union' method to preserve the categorical embeddings. This method is useful when both numerical and categorical data are important, but one type of data is not necessarily more important than the other.

Let's dive deep into Embedding Results for the intersection_union_mapper method:

In [None]:
ium_clf = DenseClus(random_state=SEED, umap_combine_method="intersection_union_mapper")
ium_clf.fit(df)

labels = ium_clf.evaluate()

### Checking Embedding Results

Verify the embeddings are now densely shaped.

In [None]:
for i in range(len(ium_clf.mapper_.embedding_[0])):
    sns.kdeplot(ium_clf.mapper_.embedding_[:, i], fill=True)

### Inspection of Cluster Results

Under the hood, among other steps, Denseclus uses HDBSCAN to cluster the data.

Let's look at the how the data got split.

In [None]:
cnts = pd.DataFrame(labels)[0].value_counts()
cnts = cnts.reset_index()
cnts.columns = ["cluster", "count"]
print(cnts.sort_values(["cluster"]))

Upon examination there are exactly 5 clusters with -1 representing the noise found in the data.

### Profiling the Clusters

Finally, once clusters are formed, it's common practice to then describe what each one means.

Here, descriptive statistics is actually a very powerful (and efficient) tool to use.

Let's ignore group `-1`, those datapoints did not fit any of the clusters

In [None]:
df["segment"] = labels
clustered_results = df[df.segment > -1]

In [None]:
numerics = clustered_results.select_dtypes(include=[int, float])
numerics.groupby("segment").mean()


A similar type of analysis is possible with categorical features:

In [None]:
categorical = df.select_dtypes(include=["object"])
categorical["segment"] = clustered_results.segment
categorical.groupby("segment").describe(include=["object"]).T

## Conclusion

This notebook walked through how and why to use DenseClus to build high quality clusters from mixed-type data. We explored 5 methods to combine information from mixed type data: "intersection", "union", "contrast", "intersection_union_mapper", "ensemble", and compared their performance. 

When applying DenseClus to your datasets, you may wish to experiment with different combine methods and hyperparameter combinations to see what works the best on your data.