# Prelimniaries

In [1]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler

  from .autonotebook import tqdm as notebook_tqdm


# Data

In [2]:
data = pd.read_parquet('../data/marketing_sample_walmart.parq.gzip')

In [3]:
data.head()

Unnamed: 0,Uniq Id,Crawl Timestamp,Product Url,Product Name,Description,List Price,Sale Price,Brand,Item Number,Gtin,Package Size,Category,Postal Code,Available
0,51b010b871cde349bd32159a1cc1a15f,2020-01-24 16:08:36 +0000,https://www.walmart.com/ip/Allegiance-Economy-...,Allegiance Economy Dual-scale Digital Thermometer,We aim to show you accurate product informati...,11.11,11.11,Cardinal Health,,707389636164,,Health | Medicine Cabinet | Thermometers | Dig...,,True
1,d6a7f100e44a626a3701804e99236ad6,2020-01-24 15:54:21 +0000,https://www.walmart.com/ip/Kenneth-Cole-Reacti...,Kenneth Cole Reaction Eau De Parfum Spray For ...,We aim to show you accurate product informati...,23.99,23.99,Kenneth Cole,,191565696101,,Premium Beauty | Premium Fragrance | Premium P...,,True
2,99d2b7da7e3e427a942f864937dacd9d,2020-01-24 18:34:28 +0000,https://www.walmart.com/ip/Kid-Tough-Fitness-I...,Kid Tough Fitness Inflatable Free-Standing Pun...,We aim to show you accurate product informati...,30.76,30.76,BONK FIT,563852139.0,855523007070,,Sports & Outdoors | Outdoor Sports | Hunting |...,,True
3,4c76d170c2c6a759cbce812d790a0b88,2020-01-24 11:08:53 +0000,https://www.walmart.com/ip/THE-FIRST-YEARS/167...,THE FIRST YEARS,We aim to show you accurate product informati...,6.99,6.99,The First Years,553299941.0,71463046263,,Baby | Diapering | Baby Wipes,,True
4,8ac95837dc8baa01e504fd8f633ffaf2,2020-03-10 07:37:21 +0000,https://www.walmart.com/ip/4-Pack-MD-USA-Seaml...,4 Pack - MD USA Seamless Toe-Wave-In Mesh Diab...,We aim to show you accurate product informatio...,28.27,28.27,MD USA,,191897514500,,Health | Diabetes Care | Diabetic Socks,,True


In [4]:
data.shape

(30000, 14)

Many of these URLs are invalid (two years old), so I'm going to treat the `Product Name` as the title that would've been retrieved from URL HTML.  Otherwise, we would fetch the titles and/or actual HTML content.

In [5]:
products = data['Product Name'].to_list()

In [6]:
products[:10]

['Allegiance Economy Dual-scale Digital Thermometer',
 'Kenneth Cole Reaction Eau De Parfum Spray For Women 3.40 Oz',
 'Kid Tough Fitness Inflatable Free-Standing Punching Bag + Machine Washable Fabric Cover South Carolina Gamecocks Kids Workout Buddy by Bonk Fit',
 'THE FIRST YEARS',
 '4 Pack - MD USA Seamless Toe-Wave-In Mesh Diabetic Crew Socks, Black, Medium, 1 Pair',
 'Gerber 2nd Foods Apple Baby Food 4 oz. Tubs 2 Count',
 'Kushies Ultra-Lite All-In-One Form-Fitted Washable Cloth Diapers (Blue Whales, Infant)',
 'sunmark Stop Smoking Aid 14 mg Strength Transdermal Patch, 70677003101 - Box of 14',
 'Berkley PowerBait Glitter Chroma-Glow Dough Fishing Bait',
 'Mikasa Rubber Basketball, Intermediate, 28.5']

# Embed Product Names

In [7]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(products, show_progress_bar=True)

Batches: 100%|███████████████████████████████████████████████████████████████████████████████| 938/938 [00:12<00:00, 76.52it/s]


In [8]:
embeddings.shape

(30000, 384)

In [9]:
pd.concat([
    pd.DataFrame({'product': products}),
    pd.DataFrame(embeddings, columns=[f'd_{i}' for i in range(embeddings.shape[1])]),
], axis=1).to_parquet('../data/out/product-embeddings.parq.gzip', compression='gzip')


# Dimensionality Reduction and Clustering

In [10]:
red = umap.UMAP(n_components=int(embeddings.shape[1]*.2), metric='cosine')
red_embed = red.fit_transform(embeddings)

In [11]:
sc = StandardScaler()
red_embed = sc.fit_transform(red_embed)

In [12]:
clust = hdbscan.HDBSCAN(min_cluster_size=10, cluster_selection_epsilon=.25)
clust.fit(red_embed)

In [13]:
res = pd.DataFrame({
    'product': products,
    'cluster': clust.labels_
})

In [14]:
res.groupby('cluster').count().sort_values('product', ascending=False)[:20]

Unnamed: 0_level_0,product
cluster,Unnamed: 1_level_1
-1,9309
468,263
134,247
9,224
252,214
100,188
503,185
480,182
254,168
504,160


In [15]:
pd.concat([
    res,
    pd.DataFrame(red_embed, columns=[f'red_{i}' for i in range(red_embed.shape[1])])
], axis=1).to_parquet('../data/out/product-embeddings-reduced.parq.gzip', compression='gzip')

# Explore a Example Cluster

In [16]:
mask = res['cluster'] == 165
res[mask]

Unnamed: 0,product,cluster
105,Hub Front Mx 36h 105g Chrome. Bike wheel part ...,165
166,"Sledge Hub, New, Case IH, 87398962, New Hollan...",165
209,Origin8 Front Hub Mountain 6 Bolt Disc 36H Silver,165
229,Wheel Masters MT - 5000 MTB Hubs - KT - A12R,165
233,Wheel Master MT-5000 Front Hub Ft Wm Mt5000 Bo...,165
...,...,...
27777,Black Ops MX-1100 Rear Flip-Flop Hub Rr Bk-ops...,165
28222,ACHR73-7050M Aluminum Wheel Hub Centric Rings ...,165
28849,Precision 513033 Hub Assembly,165
29175,Integy RC Toy Model Hop-ups C26208BLUE Billet ...,165


A list makes it easier to read the full product names

In [17]:
[p for p in res.loc[mask, 'product']]

['Hub Front Mx 36h 105g Chrome. Bike wheel part bicycle hub, bike hub, lowrider, beach cruiser, chopper, mountain limo, stretch',
 'Sledge Hub, New, Case IH, 87398962, New Holland, 87030468',
 'Origin8 Front Hub Mountain 6 Bolt Disc 36H Silver',
 'Wheel Masters MT - 5000 MTB Hubs - KT - A12R',
 'Wheel Master MT-5000 Front Hub Ft Wm Mt5000 Bo Sf 36x3/8 Sl Parralax',
 'Pyramid Hub 10 Holes 6mm RPHUBE085X278F',
 'HUB RR BK-OPS MX1100 BO LF 1s 36x3/8 SB BU',
 'RACO CRG050 Conduit Body Gasket,1/2" Hub Size,Rubber',
 'Lowrider Bicycle COTTERLESS AXLE with Bolt 126. Bike Part, Bicycle Part, Bike Accessory, Bicycle Accessory',
 'Integy RC Toy Model Hop-ups SAK-D4818/BK Aluminum Rear Hub Carrier For SAKURA D4',
 'WHEEL MASTER WHL FT 27.5 584x19 WEI 519 BK MSW 36 WM MT3000 6B BO 3/8 BK 100mm 14gBK',
 'Cyclists Choice D851Se Front Q/R Disc Hub 36H Loose Ball Black',
 'Wheelset 24X1.75 Alloy Silver 36 Trike 15mm w/Bearings',
 'WHEEL MASTER 26" Alloy Mountain Single Wall',
 'Wheel Master AQ-1010 Al

# Find Cluster Centroids

These can be used to map new data points to appropriate cluster and topic.

In [18]:
centroids = pd.concat([res, pd.DataFrame(red_embed)], axis=1).groupby('cluster').mean()

  centroids = pd.concat([res, pd.DataFrame(red_embed)], axis=1).groupby('cluster').mean()


In [19]:
centroids.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,66,67,68,69,70,71,72,73,74,75
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1,0.106051,0.107311,0.043348,0.051499,-0.078944,0.00505,-0.041569,0.008215,0.013175,-0.026252,...,0.108624,-0.077639,0.010537,-0.044047,-0.017574,-0.120348,0.039975,-0.059154,0.04105,0.110776
0,0.420124,0.144985,-0.028709,-0.69027,0.480761,0.90355,0.33721,-3.64937,-9.346144,40.813873,...,-3.42579,2.846882,5.574514,-2.22837,-0.051961,2.078664,0.357924,-0.1159,-3.534917,0.103758
1,-0.847952,-0.768998,0.120298,-0.463403,-1.327125,-44.16407,2.150733,-0.657084,-0.254208,0.015604,...,-1.098996,2.789847,-0.615899,1.422473,-1.945414,1.993151,0.572707,0.356856,-0.012678,1.227436
2,-0.509961,-0.721823,0.384099,-0.181862,-0.401151,0.323895,-0.409206,-0.098629,-0.848808,-0.074759,...,3.416388,3.417498,-0.701673,-0.852953,-0.427358,-0.058166,0.224601,-0.702159,-0.001691,-0.059847
3,-0.441115,-0.861851,0.577421,-0.036988,-0.845486,0.36788,-0.70289,0.167336,-2.747917,-0.1167,...,-2.135881,-0.397104,0.161348,0.312923,0.341742,0.498299,0.414945,0.58111,1.534165,-0.002215


In [20]:
import pickle

with open('../data/out/centroid-embeddings.p', 'wb') as fp:
    pickle.dump(centroids.to_dict('list'), fp)

# Find Similar Clusters

Here we could merge clusters with similarity >= some threshold.

In [21]:
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances

In [22]:
similarity = pd.DataFrame(
    np.triu(
        pairwise_distances(centroids, centroids)
    ),
    index=centroids.index,
    columns=centroids.index
)
similarity

cluster,-1,0,1,2,3,4,5,6,7,8,...,495,496,497,498,499,500,501,502,503,504
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1,0.0,47.018692,45.061020,45.878841,38.986359,29.289154,19.118876,15.033669,16.776175,14.947124,...,10.632122,9.617723,4.906797,5.205814,4.821348,6.825796,7.595772,8.474318,6.204001,6.171934
0,0.0,0.000000,65.851013,66.228813,61.072876,55.171204,54.020645,49.487289,49.760963,49.091125,...,48.656487,48.408516,47.252991,47.253914,47.233345,47.290234,47.525803,47.637516,47.342278,47.368412
1,0.0,0.000000,0.000000,64.923355,59.623203,53.451691,48.861607,47.648586,49.818737,47.433201,...,46.148251,45.915802,45.236015,45.255272,45.214890,45.386528,45.751770,45.895618,45.438686,45.438587
2,0.0,0.000000,0.000000,0.000000,60.357738,54.649384,48.678108,47.450127,49.094311,48.489635,...,47.417397,47.166229,46.138180,46.115257,46.083237,46.185715,46.383156,46.635914,46.214706,46.214279
3,0.0,0.000000,0.000000,0.000000,0.000000,48.788837,43.112762,41.754436,41.920837,41.663597,...,40.475346,40.195820,39.430695,39.367855,39.349255,39.590607,39.667458,39.875336,39.481728,39.493618
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
500,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,4.039701,4.733706,2.621529,2.748765
501,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.371987,2.049970,2.762351
502,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.845912,3.362164
503,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.435003


In [23]:
z = (similarity - np.eye(similarity.shape[0])).values
similar_clusters = np.argwhere((0 < z) & (z < 2))

In [24]:
len(similar_clusters)

307

In [25]:
similar_clusters[:10, :]

array([[19, 20],
       [33, 36],
       [33, 37],
       [36, 37],
       [36, 38],
       [37, 38],
       [37, 39],
       [38, 39],
       [44, 45],
       [44, 46]])

In [26]:
mask = res['cluster'] == 17
[x for x in res.loc[mask, 'product']]

['Second Generation FIT At Home Colon Cancer Test 1 Pack',
 'Wondfo Combo 40 Ovulation and 10 Pregnancy Urine Test Strips',
 '2 Pack Quality Choice PREGNANCY TEST KIT $7.99PP 1 Count Each',
 '3 Pack - FIRST RESPONSE Gold Digital Early Result Pregnancy Tests 2 Each',
 'AfterPill Emergency Contraceptive Triple Pack',
 'Easy@Home 50 Ovulation Test Strips Kit - the Reliable Ovulation Predictor Kit (50 LH Test)',
 'Digital Ovulation Test to Accurately Track 2 Key Fertility Hormones (10 Test)',
 'Accu-Clear Early Pregnancy Test Sticks 2 Each (Pack of 4)',
 'Clearblue Easy Pregnancy Earliest Results Test - 2 Ea, 2 Pack',
 'Accu-Clear Early Pregnancy Test Sticks 2 Each (Pack of 3)',
 'Clearblue Digital and Plus Pregnancy Test, 2 Ea',
 'Easy@Home 5 Pregnancy Test Sticks - hCG Midstream Tests, Powered by Premom Ovulation Predictor iOS and Android App, 5 hCG Test',
 'Fugacal 5pcs HCG Early Pregnancy Testing Pen Adult Female Pregnant Rapid Test Tool , Fast Pregnancy Test , Pregnancy Testing Pen',


In [27]:
mask = res['cluster'] == 18
[x for x in res.loc[mask, 'product']]

['YLSHRF Portable Breath LCD Digital Display Alcohol Tester/Analyzer with Backlight ,Alcohol Tester, Backlight Alcohol Analyzer',
 'OralScreen 5 Panel Oral Saliva Drug Test (15 Pack)',
 'PrimeScreen Dip Card Drug Test for Methamphetamines (5 pack)',
 '(4 Pack) AllSource MD Cocaine Dip Test',
 'At Home 6-Panel Drug Test',
 '(15 Pack) 6 Panel Urine Drug Test Dip Tamper Proof - Amp (Amphetamine), Bzo (Benzodiazepine), Coc (Cocaine), Mamp (Methamphetamine, Opi (Opiates), Thc (Marijuana)',
 'At Home Drug Test Marijuana - Each',
 '(50 PACK) QTEST 18 Panel Drug Test Cup - INCLUDES KRATOM TEST Most complete cup available on the market . AMP/BAR/BUP/BZO/COC/MET/MDMA/MTD/OPI/OXY/PCP/TCA/THC/ETG (Alcohol) /K2/KRA /TRA',
 '(10 pack) Oxycontin (OXY) Single Panel Drug Test Dip Card',
 '(1 pack) PCP Phencyclidine Surface Drug Detection Kit with Mobile APP for easy results and reports',
 'Policeman Breathalyzer Digital Alcohol Breath Tester Breathalyzer 5 Disposable Mouth',
 'Drug test, 12 Panel Drug 

# Find Most Common Words in Cluster

These would be topics.  We're doing a simple frequency analysis (vice TF-IDF) as we expect documents to be similar, thus aren't interested in words that distinguish them from others in the clusters, but rather words that are common within the cluster.

Intuitively, these results make sense.

In [28]:
from collections import Counter
import re

In [29]:
mask = res['cluster'] == 17
bow = ''.join([p for p in res.loc[mask, 'product']]).split(' ')
c = Counter(bow)
c.most_common()[:5]

[('Test', 31), ('Pregnancy', 31), ('Ovulation', 17), ('and', 12), ('-', 11)]

In [30]:
mask = res['cluster'] == 17
bow = re.findall(r'\w+', ''.join([p for p in res.loc[mask, 'product']]))
c = Counter(bow)
c.most_common()[:5]

[('Test', 37),
 ('Pregnancy', 31),
 ('Ovulation', 17),
 ('Pack', 13),
 ('Strips', 13)]

In [31]:
mask = res['cluster'] == 18
bow = ' '.join([p for p in res.loc[mask, 'product']]).split(' ')
c = Counter(bow)
c.most_common()[:5]

[('Drug', 32), ('Test', 29), ('Panel', 17), ('Dip', 13), ('pack)', 11)]

In [32]:
mask = res['cluster'] == 18
bow = re.findall(r'\w+', ''.join([p for p in res.loc[mask, 'product']]))
c = Counter(bow)
c.most_common()[:5]

[('Test', 32), ('Drug', 31), ('Panel', 18), ('5', 15), ('Dip', 13)]

## Evaluating Results

I'm not sure that similarity/distance metric for measuring clustering similarity will produce desired results.  HDBSCAN creates a hierarchy and merges under a given distance threshold.  Perhaps we should embed cluster topics so that we can better search for semantic similarity of given topics?

This of the "similar" clusters (both cosine and euclidean) is pretty off.

Alternatively, HDBSCAN gives a method for converting to NetworkX data, so that we could extract parent-child relationships amongst clusters.

HDBSCAN gives a `predict` method, so we don't really need to map distance/similarity to centroids (manually). We really only care about what cluster an observation might belong to, and then what topic that cluster maps to.