# K-Means clutering technique to find PSD families within the CAMP2Ex field campain dataset

---

## Imports

In [None]:
import xarray as xr
import datatree
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.transforms as mtransforms
from xhistogram.xarray import histogram
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score, silhouette_score
from scipy.special import gamma
from dask.distributed import Client, LocalCluster
from matplotlib.colors import ListedColormap

In [None]:
# setting up the Seaborne style including figure  dpi
sns.set(rc={"figure.dpi":150, 'savefig.dpi':150})
sns.set(style='white', font_scale=0.9)
sns.set_style("ticks")

### Local Cluster

Let's spin up our `Dask` local cluster

In [None]:
cluster = LocalCluster()  
# display(cluster)

## Data

CAMP2Ex dataset is store in Analysis-Ready Cloud-Optimized (ARCO) format ([Abernathey et al. 2021](https://ieeexplore.ieee.org/document/9354557)) using [Xarray-Datatree](https://xarray-datatree.readthedocs.io/en/latest/) data model that allows us to have both Learjet and P3B datasets in one `datatree`.

In [None]:
path_data = '../data/camp2ex_dtree.zarr'
dt_camp2ex = datatree.open_datatree(path_data, engine='zarr', consolidated=True)

In [None]:
display(dt_camp2ex['Lear'].ds)

In [None]:
display(dt_camp2ex['P3B'].ds)

Let's select the following fields we will use during the K-means clustering analysis and other variables we will use during our Deep Neural Network Training

In [None]:
cols = ['sigma', 'dm', 'log10_nw', 'r', 'nt', 'lwc_cum', 'dbz_t_ku', 'dbz_t_ka', 'mu', 'new_mu', 'Att_ka', 'temp']

Now we can merge both datasets into a single `Xarray.Dataset`

In [None]:
ds = xr.concat([dt_camp2ex['Lear'].ds[cols], dt_camp2ex['P3B'].ds[cols]], dim='time')

We discarded data with Liquid Water Content  $LWC <=0.01 gm^{-3}$ (Lance et at., 2010, Gupta et al 2021) and take $log_{10}$ of rainfall rate (r), total number concentration (nt) and liquid water content (lwc_cum)

In [None]:
ds = ds.where(ds.lwc_cum > 0.01, drop=True)
ds = ds
ds['logr'] = np.log10(ds.r)
ds['lognt'] = np.log10(ds.nt)
ds['loglwc'] = np.log10(ds.lwc_cum)
ds

Now we converted our `Xarray.Dataset` into a `Panda.Dataframe`

In [None]:
df = ds.to_dataframe().reset_index()
display(df.head(5))

## K-means

To apply the cluster analysis, we standardized our input features by removing the mean and scaling to unit variance using the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from the `Sklearn` Python package

In [None]:
scaler = StandardScaler()
df[['sigma_T', 'dm_T', 'log10_nw_T', 'logr_T', 'lognt_T', "loglwc_T"]]= scaler.fit_transform(df[['sigma', 'dm', 'log10_nw', 'logr', 'lognt', 'loglwc']])

### K-means clustering benchmarking

As a supervised machine learning technique, K-means clustering requires the number of the cluster to be defined beforehand. To determine the optimal number of clusters (k) for the PSDs, we executed the algorithm for k values ranging from 2 to 15. Using the within-cluster sum of squares (WCSS), also known as the elbow method, Davies-Bouldin index (Davies & Bouldin, 1979), and Silhouette score (Rousseeuw, 1987)

In [None]:
def get_kmeans_score(df, center):
    '''
    returns the elbow inertial index, the Davies Bouldin and Silhouette score
    INPUT:
        data - the dataset you want to fit kmeans to
        center - the number of centers you want (the k value)
    OUTPUT:
        elbow inertial index, the Davies Bouldin and Silhouette score
    '''
    kmeans = KMeans(n_clusters=center, random_state=10)
    model = kmeans.fit(df)
    model2 = kmeans.fit_predict(df)
    cluster_labels = model.labels_
    
    dav = davies_bouldin_score(df, model2)
    sil = silhouette_score(df, cluster_labels)
    elbow = model.inertia_
    return dav, sil, elbow

We defined some list to store results of each cluster results for every score. Then we test each number of cluster. 

In [None]:
dav = []
sil = []
elbow = []

for k in range(2,15):
    _dav, _sil, _el = get_kmeans_score(df[['sigma_T', 'dm_T', 'log10_nw_T', 'logr_T', 'lognt_T', "loglwc_T"]], k)
    dav.append(_dav)
    sil.append(_sil)
    elbow.append(_el)

Now, we can see the score result for different number of clusters.


In [None]:
centers = range(2,15)
fig, (ax, ax1, ax2) = plt.subplots(1, 3, figsize=(12, 4), dpi=100)
ax.plot(centers, dav, linestyle='--', marker='o', color='b');
ax.set_xlabel('K');
ax.set_ylabel('Score');
ax.set_title('Davies Bouldin method');

ax1.plot(centers, sil, linestyle='--', marker='o', color='b');
ax1.set_xlabel('K');
ax1.set_ylabel('Score');
ax1.set_title('silhouette method');

ax2.plot(centers, elbow, linestyle='--', marker='o', color='b');
ax2.set_xlabel('K');
ax2.set_ylabel('Score');
ax2.set_title('Elbow method');
fig.tight_layout()

### K-means clustering with 6 PSD families


Based on the cluster benchmarking, we deduced that k=6 the most suitable number of clusters representing the data.

In [None]:
# select scaled/transformed data
X = df[['sigma_T', 'dm_T', 'log10_nw_T', 'logr_T', 'lognt_T', "loglwc_T"]]


We can now apply the K-means cluster technique using these (X) features

In [None]:
kmeans = KMeans(n_clusters=6, random_state=10)
kmeans.fit(X)

Create a new column with the Kmeans labels. We add one to have labales from 1 to 6 (instead of 0 to 5)

In [None]:
df['kmeans_6'] = kmeans.labels_ + 1

In [None]:
# Reorder and replace some labels to make them equal when plotting mean PSDs
df['kmeans'] = df['kmeans_6'].replace([6, 1, 5, 2, 4, 3], 
                                      [1, 2, 3, 4, 5, 6])

# computing Dual Frequency Ratio
df['dfr'] = df['dbz_t_ku'] - df['dbz_t_ka']

In [None]:
# function that computes the Normalized-Gama size distribution
def norm_gamma(d, nw, mu, dm):
    """
    Functions that computes the normalized-gamma size distritubion (Testud et al., 2002)
    Param d: diameter in mm
    Param nw: Normalized intercep parameter
    Param mu: Shape parameter
    Param dm: Mass-weighted mean diameter
    """
    f_mu = (6 * (4 + mu) ** (mu + 4)) / (4 ** 4 * gamma(mu + 4) )
    slope = (4 + mu) / dm
    return nw * f_mu * (d / dm) ** mu * np.exp(-slope * d)

### K-means results 

Scatter plot of Dm and Nw colored by each PSD family is plotted as following. Mean PSD computed using the mean quantities of each parameter at each group is also displayed

In [None]:
# number of clusters
n_c = 6
# defining the Colormap for each cluster identified
my_cmap6 = ListedColormap(sns.color_palette('deep', n_c))
colors6 = my_cmap6(np.linspace(0,1, n_c))

In [None]:
# Plotting results 
fig, axs = plt.subplot_mosaic([['a)', 'b)']], figsize=(8,4))

# left panel
ax = axs['a)']
# Scatter plot of Dm and Nw
ax = sns.scatterplot(data=df, x=df['dm'], y=df['log10_nw'], hue=df['kmeans'], s=3, ax=ax, 
                          palette=sns.color_palette('deep', 6), legend=False, edgecolor=None)

ax.set_xlabel("$D_m \ [mm]$")
ax.set_ylabel("$Log_{10}(Nw) \ [Log_{10}(mm^{1}mm^{-3})]$")
ax.grid('both', linestyle='--', lw=0.5, dashes=[7,7])

dms = np.linspace(df['dm'].min(), df['dm'].max(), 100)

# Plotting Bringi et al (2009) convective-stratiform separation
s_c = -1.6 * dms + 6.3
ax.plot(dms, s_c, c='k', ls='-.', lw=0.8, label=r"$Bringi \ et \ al. \ (2009)$")
ax.legend()

# right panel
ax1 = axs['b)']
ax1.set_yscale('log')
ax1.set_ylim(1e-3, 1e9)
d = dt_camp2ex['Lear'].ds.diameter/1000
ax1.grid('both')
ax1.set_ylabel("$N(D) \  [mm^{-1}m^{-3}]$")
ax1.set_xlabel("$D\ [mm]$")
ax1.grid('both', linestyle='--', lw=0.5, dashes=[7,7])
ax1.set_xlim(-0.2, 3)

# computing the mean particle size distribution for each group
for i in range(1, n_c + 1):
    df_sub = df[df['kmeans'] == i]
    mu = df_sub['mu'].quantile(0.5)
    dm = df_sub['dm'].quantile(0.5)
    nw = (10 ** (df_sub['log10_nw'])).quantile(0.5)
    gm = norm_gamma(d, nw=nw, mu=mu, dm=dm)
    ax1.plot(d, gm, c=colors6[i-1], label=f"Group {i}")


lines_labels = [ax.get_legend_handles_labels() for ax in fig.axes]
lines, labels = [sum(lol, []) for lol in zip(*lines_labels)]


fig.legend(lines[1:], labels[1:], loc='upper center', ncol=6, bbox_to_anchor=[0.5, 1.025])
for label, ax in axs.items():
    # label physical distance in and down:
    trans = mtransforms.ScaledTranslation(-45/72, -1/72, fig.dpi_scale_trans)
    ax.text(0.0, 1.0, label, transform=ax.transAxes + trans,
            fontsize='medium', verticalalignment='top')
fig.tight_layout()

## Dataset Imbalance

It is safe to check data imbalance before performing a machine learning algorithm. We set a $1 mm$ treshold in $D_m$ for counting the number of PSDs within each category. 

In [None]:
# creating a categorical variable to split data into greater and smaller Dm
df['dm_class'] = (df.dm >= 1.0).astype(int)

Then, we can create a two-dimension histogram to see the density distribution of our dataset. Also, we can include a bar diagram with the two classess we previously defined ($D_m >= 1mm$ and $D_m < 1mm$)

In [None]:
# Creating the 2D-histogram inputs
xbins = np.linspace(ds.dm.min(), ds.dm.max(), 50)
ybins = np.linspace(ds.log10_nw.min(), ds.log10_nw.max(), 50)
psd = histogram(ds.dm, ds.log10_nw, bins=[xbins, ybins])

In [None]:
# Plotting data imbalance results
fig, ax = plt.subplots(figsize=(5.5,4.5))

# 2D-histogram
im = psd.T.where(psd.T > 0, np.nan).plot(add_colorbar=False, ax=ax, cmap='magma_r', vmin=0, vmax=100)
fig.colorbar(im , ax=ax, label=r"$Counts$")
ax.set_xlabel("$D_m \ [mm]$")
ax.set_ylabel("$Log_{10}(Nw) \ [Log_{10}(mm^{1}mm^{-3})]$")
sns.despine()
ax.grid('both', linestyle='--', lw=0.5, dashes=[7,7])
ax.set_xlim(-0.1, 2.8)
ax.set_ylim(2, 11)
ax.vlines(x=1, ymin=2, ymax=11, lw=0.5, linestyle='--', color='k')

# Bar plot
l, b, h, w = .45, .60, .15, .3
ax2 = fig.add_axes([l, b, w, h])
bar_colors = ['tab:red', 'tab:blue']
ax2.bar(['$D_m < 1.0$', "$D_m \geq 1.0$"], np.bincount(df['dm_class']), color=bar_colors)
ax2.set_title("Counts")

### Saving dataframe

We saved the Kmeans output and dataset imbalance results for further analysis

In [None]:
df.to_parquet('../data/df_cluster.parquet')