generated from databricks-industry-solutions/industry-solutions-blueprints
-
Notifications
You must be signed in to change notification settings - Fork 4
/
03_Clustering.py
617 lines (463 loc) · 28.7 KB
/
03_Clustering.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
# Databricks notebook source
# MAGIC %md
# MAGIC You may find this series of notebooks at https://github.com/databricks-industry-solutions/segmentation.git. For more information about this solution accelerator, visit https://www.databricks.com/solutions/accelerators/customer-segmentation.
# COMMAND ----------
# MAGIC %md The purpose of this notebook is to identify potential segments for our households using a clustering technique.
# COMMAND ----------
# DBTITLE 1,Import Required Libraries
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.model_selection import train_test_split
from scipy.cluster.hierarchy import dendrogram, set_link_color_palette
import numpy as np
import pandas as pd
import mlflow
import os
from delta.tables import *
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors
import seaborn as sns
# COMMAND ----------
# MAGIC %md ## Step 1: Retrieve Features
# MAGIC
# MAGIC Following the work performed in our last notebook, our households are now identified by a limited number of features that capture the variation found in our original feature set. We can retrieve these features as follows:
# COMMAND ----------
# DBTITLE 1,Initialize the gold table paths
dbutils.fs.rm("/tmp/completejourney/gold/", True)
dbutils.fs.mkdirs("/tmp/completejourney/gold/")
# COMMAND ----------
# DBTITLE 1,Retrieve Transformed Features
# retrieve household (transformed) features
household_X_pd = spark.table('DELTA.`/tmp/completejourney/silver/features_finalized/`').toPandas()
# remove household ids from dataframe
X = household_X_pd.drop(['household_id'], axis=1)
household_X_pd
# COMMAND ----------
# MAGIC %md The exact meaning of each feature is very difficult to articulate given the complex transformations used in their engineering. Still, they can be used to perform clustering. (Through profiling which we will perform in our next notebook, we can then retrieve insight into the nature of each cluster.)
# MAGIC
# MAGIC As a first step, let's visualize our data to see if any natural groupings stand out. Because we are working with a hyper-dimensional space, we cannot perfectly visualize our data but with a 2-D representation (using our first two principal component features), we can see there is a large sizeable cluster in our data and potentially a few additional, more loosely organized clusters:
# COMMAND ----------
# DBTITLE 1,Plot Households
fig, ax = plt.subplots(figsize=(10,8))
_ = sns.scatterplot(
data=X,
x='Dim_1',
y='Dim_2',
alpha=0.5,
ax=ax
)
# COMMAND ----------
# MAGIC %md ## Step 2: K-Means Clustering
# MAGIC
# MAGIC Our first attempt at clustering with make use of the K-means algorithm. K-means is a simple, popular algorithm for dividing instances into clusters around a pre-defined number of *centroids* (cluster centers). The algorithm works by generating an initial set of points within the space to serve as cluster centers. Instances are then associated with the nearest of these points to form a cluster, and the true center of the resulting cluster is re-calculated. The new centroids are then used to re-enlist cluster members, and the process is repeated until a stable solution is generated (or until the maximum number of iterations is exhausted). A quick demonstration run of the algorithm may produce a result as follows:
# COMMAND ----------
# DBTITLE 1,Demonstrate Cluster Assignment
# set up the experiment that mlflow logs runs to: an experiment in the user's personal workspace folder
useremail = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()
experiment_name = f"/Users/{useremail}/segmentation"
mlflow.set_experiment(experiment_name)
# initial cluster count
initial_n = 4
# train the model
initial_model = KMeans(
n_clusters=initial_n,
max_iter=1000
)
# fit and predict per-household cluster assignment
init_clusters = initial_model.fit_predict(X)
# combine households with cluster assignments
labeled_X_pd = (
pd.concat(
[X, pd.DataFrame(init_clusters,columns=['cluster'])],
axis=1
)
)
# visualize cluster assignments
fig, ax = plt.subplots(figsize=(10,8))
sns.scatterplot(
data=labeled_X_pd,
x='Dim_1',
y='Dim_2',
hue='cluster',
palette=[cm.nipy_spectral(float(i) / initial_n) for i in range(initial_n)],
legend='brief',
alpha=0.5,
ax = ax
)
_ = ax.legend(loc='lower right', ncol=1, fancybox=True)
# COMMAND ----------
# MAGIC %md Our initial model run demonstrates the mechanics of generating a K-means clustering solution, but it also demonstrates some of the shortcomings of the approach. First, we need to specify the number of clusters. Setting the value incorrectly can force the creation of numerous smaller clusters or just a few larger clusters, neither of which may reflect what we may observe to be the more immediate and natural structure inherent to the data.
# MAGIC
# MAGIC Second, the results of the algorithm are highly dependent on the centroids with which it is initialized. The use of the K-means++ initialization algorithm addresses some of these problems by better ensuring that initial centroids are dispersed throughout the populated space, but there is still an element of randomness at play in these selections that can have big consequences for our results.
# MAGIC
# MAGIC To begin working through these challenges, we will generate a large number of model runs over a range of potential cluster counts. For each run, we will calculate the sum of squared distances between members and assigned cluster centroids (*inertia*) as well as a secondary metric (*silhouette score*) which provides a combined measure of inter-cluster cohesion and intra-cluster separation (ranging between -1 and 1 with higher values being better). Because of the large number of iterations we will perform, we will distribute this work across our Databricks cluster so that it can be concluded in a timely manner:
# MAGIC
# MAGIC **NOTE** We are using a Spark RDD as a crude means of exhaustively searching our parameter space in a distributed manner. This is an simple technique frequently used for efficient searches over a defined range of values.
# COMMAND ----------
# DBTITLE 1,Iterate over Potential Values of K
# broadcast features so that workers can access efficiently
X_broadcast = sc.broadcast(X)
# function to train model and return metrics
def evaluate_model(n):
model = KMeans( n_clusters=n, init='k-means++', n_init=1, max_iter=10000)
clusters = model.fit(X_broadcast.value).labels_
return n, float(model.inertia_), float(silhouette_score(X_broadcast.value, clusters))
# define number of iterations for each value of k being considered
iterations = (
spark
.range(100) # iterations per value of k
.crossJoin( spark.range(2,21).withColumnRenamed('id','n')) # cluster counts
.repartition(sc.defaultParallelism)
.select('n')
.rdd
)
# train and evaluate model for each iteration
results_pd = (
spark
.createDataFrame(
iterations.map(lambda n: evaluate_model(n[0])), # iterate over each value of n
schema=['n', 'inertia', 'silhouette']
).toPandas()
)
# remove broadcast set from workers
X_broadcast.unpersist()
display(results_pd)
# COMMAND ----------
# MAGIC %md Plotting inertia relative to n, *i.e.* the target number of clusters, we can see that the total sum of squared distances between cluster members and cluster centers decreases as we increase the number of clusters in our solution. Our goal is not to drive inertia to zero (which would be achieved if we made each member the center of its own, 1-member cluster) but instead to identify the point in the curve where the incremental drop in inertia is diminished. In our plot, we might identify this point as occurring somewhere between 2 and 6 clusters:
# COMMAND ----------
# DBTITLE 1,Inertia over Cluster Count
display(results_pd)
# COMMAND ----------
# MAGIC %md Interpreting the *elbow chart*/*scree plot* of inertia *vs.* n is fairly subjective, and as such, it can be helpful to examine how another metric behaves relative to our cluster count. Plotting silhouette score relative to n provides us the opportunity to identify a peak (*knee*) beyond which the score declines. The challenge, as before, is exactly determining the location of that peak, especially given that the silhouette scores for our iterations vary much more than our inertia scores:
# COMMAND ----------
# DBTITLE 1,Silhouette Score over Cluster Count
display(results_pd)
# COMMAND ----------
# MAGIC %md While providing a second perspective, the plot of silhouette scores reinforces the notion that selecting a number of clusters for K-means is a bit subjective. Domain knowledge coupled with inputs from these and similar charts (such as a chart of the [Gap statistic](https://towardsdatascience.com/k-means-clustering-and-the-gap-statistics-4c5d414acd29)) may help point you towards an optimal cluster count but there are no widely-accepted, objective means of determining this value to date.
# MAGIC
# MAGIC **NOTE** We need to be careful to avoid chasing the highest value for the silhouette score in the knee chart. Higher scores can be obtained with higher values of n by simply pushing outliers into trivially small clusters.
# MAGIC
# MAGIC For our model, we'll go with a value of 2. Looking at the plot of inertia, there appears to be evidence supporting this value. Examining the silhouette scores, the clustering solution appears to be much more stable at this value than at values further down the range. To obtain domain knowledge, we might speak with our promotions experts and gain their perspective on not only how different households respond to promotions but what might be a workable number of clusters from this exercise. But most importantly, from our visualization, the presence of 2 well-separated clusters seems to naturally jump out at us.
# MAGIC
# MAGIC With a value for n identified, we now need to generate a final cluster design. Given the randomness of the results we obtain from a K-means run (as captured in the widely variable silhouette scores), we might take a *best-of-k* approach to defining our cluster model. In such an approach, we run through some number of K-means model runs and select the run that delivers the best result as measured by a metric such as silhouette score. To distribute this work, we'll implement a custom function that will allow us to task each worker with finding a best-of-k solution and then take the overall best solution from the results of that work:
# MAGIC
# MAGIC **NOTE** We are again using an RDD to allow us to distribute the work across our cluster. The *iterations* RDD will hold a value for each iteration to perform. Using *mapPartitions()* we will determine how many iterations are assigned to a given partition and then force that worker to perform an appropriately configured best-of-k evaluation. Each partition will send back the best model it could discover and then we will take the best from these.
# COMMAND ----------
# DBTITLE 1,Identify Best of K Model
total_iterations = 50000
n_for_bestofk = 2
X_broadcast = sc.broadcast(X)
def find_bestofk_for_partition(partition):
# count iterations in this partition
n_init = sum(1 for i in partition)
# perform iterations to get best of k
model = KMeans( n_clusters=n_for_bestofk, n_init=n_init, init='k-means++', max_iter=10000)
model.fit(X_broadcast.value)
# score model
score = float(silhouette_score(X_broadcast.value, model.labels_))
# return (score, model)
yield (score, model)
# build RDD for distributed iteration
iterations = sc.range(
total_iterations,
numSlices= sc.defaultParallelism * 4
) # distribute work into fairly even number of partitions that allow us to track progress
# retrieve best of distributed iterations
bestofk_results = (
iterations
.mapPartitions(find_bestofk_for_partition)
.sortByKey(ascending=False)
.take(1)
)[0]
# get score and model
bestofk_score = bestofk_results[0]
bestofk_model = bestofk_results[1]
bestofk_clusters = bestofk_model.labels_
# print best score obtained
print('Silhouette Score: {0:.6f}'.format(bestofk_score))
# combine households with cluster assignments
bestofk_labeled_X_pd = (
pd.concat(
[X, pd.DataFrame(bestofk_clusters,columns=['cluster'])],
axis=1
)
)
# clean up
X_broadcast.unpersist()
# COMMAND ----------
# MAGIC %md We can now visualize our results to get a sense of how the clusters align with the structure of our data:
# COMMAND ----------
# DBTITLE 1,Visualize Best of K Clusters
# visualize cluster assignments
fig, ax = plt.subplots(figsize=(10,8))
sns.scatterplot(
data=bestofk_labeled_X_pd,
x='Dim_1',
y='Dim_2',
hue='cluster',
palette=[cm.nipy_spectral(float(i) / n_for_bestofk) for i in range(n_for_bestofk)], # align colors with those used in silhouette plots
legend='brief',
alpha=0.5,
ax = ax
)
_ = ax.legend(loc='lower right', ncol=1, fancybox=True)
# COMMAND ----------
# MAGIC %md The results of our analysis are not earth-shattering but they don't need to be. Our data would indicate that for these features we could very reasonably consider our customer households as existing in two fairly distinct groups. That said, we might want to look at how well individual households sit within these groups, which we can do through a per-instance silhouette chart:
# MAGIC
# MAGIC **NOTE** This code represents a modified version of the [silhouette charts](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) provided in the Sci-Kit Learn documentation.
# COMMAND ----------
# DBTITLE 1,Examine Per-Member Silhouette Scores
# modified from https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
def plot_silhouette_chart(features, labels):
n = len(np.unique(labels))
# configure plot area
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(8, 5)
# configure plots for silhouette scores between -1 and 1
ax.set_xlim([-0.1, 1])
ax.set_ylim([0, len(features) + (n + 1) * 10])
# avg silhouette score
score = silhouette_score(features, labels)
# compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(features, labels)
y_lower = 10
for i in range(n):
# get and sort members by cluster and score
ith_cluster_silhouette_values = sample_silhouette_values[labels == i]
ith_cluster_silhouette_values.sort()
# size y based on sample count
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
# pretty up the charts
color = cm.nipy_spectral(float(i) / n)
ax.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# label the silhouette plots with their cluster numbers at the middle
ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax.set_title("Average silhouette of {0:.3f} with {1} clusters".format(score, n))
ax.set_xlabel("The silhouette coefficient values")
ax.set_ylabel("Cluster label")
# vertical line for average silhouette score of all the values
ax.axvline(x=score, color="red", linestyle="--")
ax.set_yticks([]) # clear the yaxis labels / ticks
ax.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
return fig, ax
_ = plot_silhouette_chart(X, bestofk_clusters)
# COMMAND ----------
# MAGIC %md From the silhouette chart, we would appear to have one cluster a bit larger than the other. That cluster appears to be reasonably coherent. Our other clusters appear to be a bit more dispersed with a more rapid decline in silhouette score values ultimately leading a few members to have negative silhouette scores (indicating overlap with other cluster).
# MAGIC
# MAGIC This solution may be useful for better understanding customer behavior relative to promotional offers. We'll persist our cluster assignments before examining other clustering techniques:
# COMMAND ----------
# DBTITLE 1,Persist Cluster Assignments
# persist household id and cluster assignment
(
spark # bring together household and cluster ids
.createDataFrame(
pd.concat(
[household_X_pd, pd.DataFrame(bestofk_clusters,columns=['bestofk_cluster'])],
axis=1
)[['household_id','bestofk_cluster']]
)
.write # write data to delta
.format('delta')
.mode('overwrite')
.option('overwriteSchema','true')
.save('/tmp/completejourney/gold/household_clusters/')
)
# COMMAND ----------
# MAGIC %md ## Step 3: Hierarchical Clustering
# MAGIC
# MAGIC In addition to K-means, hierarchical clustering techniques are frequently used in customer segmentation exercises. With the agglomerative-variants of these techniques, clusters are formed by linking members closest to one another and then linking those clusters to form higher level clusters until a single cluster encompassing all the members of the set is formed.
# MAGIC
# MAGIC Unlike K-means, the agglomerative process is deterministic so that repeated runs on the same dataset lead to the same clustering outcome. So while the hierarchical clustering techniques are frequently criticized for being slower than K-means, the overall processing time to arrive at a particular result may be lessened as no repeat executions of the algorithm are required to arrive at a *best-of* outcome.
# MAGIC
# MAGIC To get a better sense of how this technique works, let's train a hierarchical clustering solution and visualize its output:
# COMMAND ----------
# DBTITLE 1,Function to Plot Dendrogram
# modified from https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py
# function to generate dendrogram
def plot_dendrogram(model, **kwargs):
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_,
model.distances_,
counts]
).astype(float)
# Plot the corresponding dendrogram
j = 5
set_link_color_palette(
[matplotlib.colors.rgb2hex(cm.nipy_spectral(float(i) / j)) for i in range(j)]
)
dendrogram(linkage_matrix, **kwargs)
# COMMAND ----------
# DBTITLE 1,Train & Visualize Hierarchical Model
# train cluster model
inithc_model = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='ward')
inithc_model.fit(X)
# generate visualization
fig, ax = plt.subplots(1, 1)
fig.set_size_inches(15, 8)
plot_dendrogram(inithc_model, truncate_mode='level', p=6) # 6 levels max
plt.title('Hierarchical Clustering Dendrogram')
_ = plt.xlabel('Number of points in node (or index of point if no parenthesis)')
# COMMAND ----------
# MAGIC %md The dendrogram is read from the bottom up. Each initial point represents a cluster consisting of some number of members. The entire process by which those members come together to form those specific clusters is not visualized (though you can adjust the *p* argument in the *plot_dendrograms* function to see further down into the process).
# MAGIC
# MAGIC As you move up the dendrogram, clusters converge to form new clusters. The vertical length traversed to reach that point of convergence tells us something about the distance between these clusters. The longer the length, the wider the gap between the converging clusters.
# MAGIC
# MAGIC The dendrogram gives us a sense of how the overall structure of the dataset comes together but it doesn't steer us towards a specific number of clusters for our ultimate clustering solution. For that, we need to revert to the plotting of a metric, such as silhouette scores, to identify the appropriate number of clusters for our solution.
# MAGIC
# MAGIC Before plotting silhouette against various numbers of clusters, it's important to examine the means by which clusters are combined to form new clusters. There are many algorithms (*linkages*) for this. The SciKit-Learn library supports four of them. These are:
# MAGIC <p>
# MAGIC * *ward* - link clusters such that the sum of squared distances within the newly formed clusters is minimized
# MAGIC * *average* - link clusters based on the average distance between all points in the clusters
# MAGIC * *single* - link clusters based on the minimum distance between any two points in the clusters
# MAGIC * *complete* - link clusters based on the maximum distance between any two points in the clusters
# MAGIC
# MAGIC Different linkage mechanisms can result in very different clustering outcomes. Ward's method (denoted by the *ward* linkage mechanism) is considered the go-to for most clustering exercises unless domain knowledge dictates the use of an alternative method:
# COMMAND ----------
# DBTITLE 1,Identify Number of Clusters
results = []
# train models with n number of clusters * linkages
for a in ['ward']: # linkages
for n in range(2,21): # evaluate 2 to 20 clusters
# fit the algorithm with n clusters
model = AgglomerativeClustering(n_clusters=n, linkage=a)
clusters = model.fit(X).labels_
# capture the inertia & silhouette scores for this value of n
results += [ (n, a, silhouette_score(X, clusters)) ]
results_pd = pd.DataFrame(results, columns=['n', 'linkage', 'silhouette'])
display(results_pd)
# COMMAND ----------
# MAGIC %md The results would indicate our best results may be found using 5 clusters:
# COMMAND ----------
# DBTITLE 1,Train & Evaluate Model
n_for_besthc = 5
linkage_for_besthc = 'ward'
# configure model
besthc_model = AgglomerativeClustering( n_clusters=n_for_besthc, linkage=linkage_for_besthc)
# train and predict clusters
besthc_clusters = besthc_model.fit(X).labels_
# score results
besthc_score = silhouette_score(X, besthc_clusters)
# print best score obtained
print('Silhouette Score: {0:.6f}'.format(besthc_score))
# combine households with cluster assignments
besthc_labeled_X_pd = (
pd.concat(
[X, pd.DataFrame(besthc_clusters,columns=['cluster'])],
axis=1
)
)
# COMMAND ----------
# MAGIC %md Visualizing these clusters, we can see how groupings reside within the data structure. In our initial visualization of the features, we argued that there were two high-level clusters that stood out (and our K-means algorithm seemed to pick this up very well). Here, our hierarchical clustering algorithm seems to have picked up on the looser subclusters a bit better, though it also seems to have picked up on some loosely organized households for one very small cluster:
# COMMAND ----------
# DBTITLE 1,Visualize Clusters
# visualize cluster assignments
fig, ax = plt.subplots(figsize=(10,8))
sns.scatterplot(
data=besthc_labeled_X_pd,
x='Dim_1',
y='Dim_2',
hue='cluster',
palette=[cm.nipy_spectral(float(i) / n_for_besthc) for i in range(n_for_besthc)], # align colors with those used in silhouette plots
legend='brief',
alpha=0.5,
ax = ax
)
_ = ax.legend(loc='lower right', ncol=1, fancybox=True)
# COMMAND ----------
# MAGIC %md Our per-instance silhouette scores show us we have a bit more overlap between clusters when examined at this level. One of the clusters has so few members it doesn't seem worth keeping it, especially when we review the 2-D visualization and see that these points seem to be highly intermixed with other clusters (at least when viewed from this perspective):
# COMMAND ----------
# DBTITLE 1,Examine Per-Member Silhouette Scores
_ = plot_silhouette_chart(X, besthc_clusters)
# COMMAND ----------
# MAGIC %md With that in mind, we'll retrain our model with a cluster count of 4 and then persist those results:
# COMMAND ----------
# DBTITLE 1,ReTrain & Evaluate Model
n_for_besthc = 4
linkage_for_besthc = 'ward'
# configure model
besthc_model = AgglomerativeClustering( n_clusters=n_for_besthc, linkage=linkage_for_besthc)
# train and predict clusters
besthc_clusters = besthc_model.fit(X).labels_
# score results
besthc_score = silhouette_score(X, besthc_clusters)
# print best score obtained
print('Silhouette Score: {0:.6f}'.format(besthc_score))
# combine households with cluster assignments
besthc_labeled_X_pd = (
pd.concat(
[X, pd.DataFrame(besthc_clusters,columns=['cluster'])],
axis=1
)
)
# COMMAND ----------
# DBTITLE 1,Visualize Clusters
# visualize cluster assignments
fig, ax = plt.subplots(figsize=(10,8))
sns.scatterplot(
data=besthc_labeled_X_pd,
x='Dim_1',
y='Dim_2',
hue='cluster',
palette=[cm.nipy_spectral(float(i) / n_for_besthc) for i in range(n_for_besthc)], # align colors with those used in silhouette plots
legend='brief',
alpha=0.5,
ax = ax
)
_ = ax.legend(loc='lower right', ncol=1, fancybox=True)
# COMMAND ----------
# DBTITLE 1,Examine Per-Member Silhouette Scores
_ = plot_silhouette_chart(X, besthc_clusters)
# COMMAND ----------
# DBTITLE 1,Add Field to Hold Hierarchical Cluster Assignment
# add column to previously created table to allow assignment of cluster ids
# try/except used here in case this statement is being rurun against a table with field already in place
try:
spark.sql('ALTER TABLE DELTA.`/tmp/completejourney/gold/household_clusters/` ADD COLUMN (hc_cluster integer)')
except:
pass
# COMMAND ----------
# DBTITLE 1,Update Persisted Data to Hold Hierarchical Cluster Assignment
# assemble household IDs and new cluster IDs
updates = (
spark
.createDataFrame(
pd.concat(
[household_X_pd, pd.DataFrame(besthc_clusters,columns=['hc_cluster'])],
axis=1
)[['household_id','hc_cluster']]
)
)
# merge new cluster ID data with existing table
deltaTable = DeltaTable.forPath(spark, '/tmp/completejourney/gold/household_clusters/')
(
deltaTable.alias('target')
.merge(
updates.alias('source'),
'target.household_id=source.household_id'
)
.whenMatchedUpdate(set = { 'hc_cluster' : 'source.hc_cluster' } )
.execute()
)
# COMMAND ----------
# MAGIC %md ## Step 4: Other Techniques
# MAGIC
# MAGIC We have only begun to scratch the surface on the clustering techniques available to us. [K-Medoids](https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html), a variation of K-means which centers clusters on actual members in the dataset, allows for alternative methods (other than just Euclidean distance) of considering member similarities and may be more robust to noise and outliers in a dataset. [Density-Based Spatial Clustering of Applications with Noise (DBSCAN)](https://scikit-learn.org/stable/modules/clustering.html#dbscan) is another interesting clustering technique which identifies clusters in areas of high member density while ignoring dispersed members in lower-density regions. This would seem to be a good technique for this dataset but in our examination of DBSCAN (not shown), we had difficulty tuning the *epsilon* and *minimum sample count* parameters (that control how high-density regions are identified) to produce a high-quality clustering solution. And [Gaussian Mixture Models](https://scikit-learn.org/stable/modules/mixture.html#gaussian-mixture-models) offer still another approach popular in segmentation exercises which allows clusters with non-spherical shapes to be more easily formed.
# MAGIC
# MAGIC In addition to alternative algorithms, there is emerging work in the development of cluster ensemble models (aka *consensus clustering*). First introduced by [Monti *et al.*](https://link.springer.com/article/10.1023/A:1023949509487) for application in genomics research, consensus clustering has found popularity in a broad range of life science applications though there appears to be little adoption to date in the area of customer segmentation. Support for consensus clustering through the [OpenEnsembles](https://www.jmlr.org/papers/v19/18-100.html) and [kemlglearn](https://nbviewer.jupyter.org/github/bejar/URLNotebooks/blob/master/Notebooks/12ConsensusClustering.ipynb) packages is available in Python though much more robust support for consensus clustering can be found in R libraries such as [diceR](https://cran.r-project.org/web/packages/diceR/index.html). A limited exploration of these packages and libraries (not shown) produced mixed results though we suspect this has more to do with our own challenges with hyperparameter tuning and less to do with the algorithms themselves.