# Prelimniaries

In [1]:
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler

  from .autonotebook import tqdm as notebook_tqdm


# Data

In [2]:
data = pd.read_parquet('../data/marketing_sample_walmart.parq.gzip')

In [3]:
data.head()

Unnamed: 0,Uniq Id,Crawl Timestamp,Product Url,Product Name,Description,List Price,Sale Price,Brand,Item Number,Gtin,Package Size,Category,Postal Code,Available
0,51b010b871cde349bd32159a1cc1a15f,2020-01-24 16:08:36 +0000,https://www.walmart.com/ip/Allegiance-Economy-...,Allegiance Economy Dual-scale Digital Thermometer,We aim to show you accurate product informati...,11.11,11.11,Cardinal Health,,707389636164,,Health | Medicine Cabinet | Thermometers | Dig...,,True
1,d6a7f100e44a626a3701804e99236ad6,2020-01-24 15:54:21 +0000,https://www.walmart.com/ip/Kenneth-Cole-Reacti...,Kenneth Cole Reaction Eau De Parfum Spray For ...,We aim to show you accurate product informati...,23.99,23.99,Kenneth Cole,,191565696101,,Premium Beauty | Premium Fragrance | Premium P...,,True
2,99d2b7da7e3e427a942f864937dacd9d,2020-01-24 18:34:28 +0000,https://www.walmart.com/ip/Kid-Tough-Fitness-I...,Kid Tough Fitness Inflatable Free-Standing Pun...,We aim to show you accurate product informati...,30.76,30.76,BONK FIT,563852139.0,855523007070,,Sports & Outdoors | Outdoor Sports | Hunting |...,,True
3,4c76d170c2c6a759cbce812d790a0b88,2020-01-24 11:08:53 +0000,https://www.walmart.com/ip/THE-FIRST-YEARS/167...,THE FIRST YEARS,We aim to show you accurate product informati...,6.99,6.99,The First Years,553299941.0,71463046263,,Baby | Diapering | Baby Wipes,,True
4,8ac95837dc8baa01e504fd8f633ffaf2,2020-03-10 07:37:21 +0000,https://www.walmart.com/ip/4-Pack-MD-USA-Seaml...,4 Pack - MD USA Seamless Toe-Wave-In Mesh Diab...,We aim to show you accurate product informatio...,28.27,28.27,MD USA,,191897514500,,Health | Diabetes Care | Diabetic Socks,,True


In [4]:
data.shape

(30000, 14)

Many of these URLs are invalid (two years old), so I'm going to treat the `Product Name` as the title that would've been retrieved from URL HTML.  Otherwise, we would fetch the titles and/or actual HTML content.

In [5]:
products = data['Product Name'].to_list()

In [6]:
products[:10]

['Allegiance Economy Dual-scale Digital Thermometer',
 'Kenneth Cole Reaction Eau De Parfum Spray For Women 3.40 Oz',
 'Kid Tough Fitness Inflatable Free-Standing Punching Bag + Machine Washable Fabric Cover South Carolina Gamecocks Kids Workout Buddy by Bonk Fit',
 'THE FIRST YEARS',
 '4 Pack - MD USA Seamless Toe-Wave-In Mesh Diabetic Crew Socks, Black, Medium, 1 Pair',
 'Gerber 2nd Foods Apple Baby Food 4 oz. Tubs 2 Count',
 'Kushies Ultra-Lite All-In-One Form-Fitted Washable Cloth Diapers (Blue Whales, Infant)',
 'sunmark Stop Smoking Aid 14 mg Strength Transdermal Patch, 70677003101 - Box of 14',
 'Berkley PowerBait Glitter Chroma-Glow Dough Fishing Bait',
 'Mikasa Rubber Basketball, Intermediate, 28.5']

# Embed Product Names

In [7]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(products, show_progress_bar=True)

Batches: 100%|███████████████████████████████████████████████████████████████████████████████| 938/938 [02:50<00:00,  5.51it/s]


In [8]:
embeddings.shape

(30000, 384)

In [20]:
pd.concat([
    pd.DataFrame({'product': products}),
    pd.DataFrame(embeddings, columns=[f'd_{i}' for i in range(embeddings.shape[1])]),
], axis=1).to_parquet('../data/out/product-embeddings.parq.gzip', compression='gzip')


# Dimensionality Reduction and Clustering

In [12]:
red = umap.UMAP(n_components=int(embeddings.shape[1]*.2), metric='cosine')
red_embed = red.fit_transform(embeddings)

In [13]:
sc = StandardScaler()
red_embed = sc.fit_transform(red_embed)

In [109]:
clust = hdbscan.HDBSCAN(min_cluster_size=10, cluster_selection_epsilon=.25)
clust.fit(red_embed)

In [110]:
res = pd.DataFrame({
    'product': products,
    'cluster': clust.labels_
})

In [115]:
res.groupby('cluster').count().sort_values('product', ascending=False)[:20]

Unnamed: 0_level_0,product
cluster,Unnamed: 1_level_1
-1,9507
281,361
467,252
95,246
6,221
353,192
62,182
128,182
464,166
453,164


In [22]:
pd.concat([
    res,
    pd.DataFrame(red_embed, columns=[f'red_{i}' for i in range(red_embed.shape[1])])
], axis=1).to_parquet('../data/out/product-embeddings-reduced.parq.gzip', compression='gzip')

# Explore a Example Cluster

In [116]:
mask = res['cluster'] == 165
res[mask]

Unnamed: 0,product,cluster
220,New Hunting Optics Rifle Scope Gun Scope 6-24x...,165
591,gamo 632270254 blue flame .177 cal qty of 100 ...,165
765,"Athlon Optics Argos BTR Riflescope 6-24x50mm, ...",165
846,Hawke Vantage Rifle Scope 4x32 Airsoft,165
855,2-Pack 45 Degree Angle Offset Picatinny Weaver...,165
...,...,...
29358,"AR500 Steel Target 12"" x 3/8"" Target Gong Tact...",165
29405,PHYSOSTIGMA VENENOSUM 30C - 400 Pellets (4dm),165
29408,Nikon Prostaff P3 4-12x40 Riflescope w/ Nikopl...,165
29520,Vortex 1-Inch Medium Riflescope Rings (Set of 2),165


A list makes it easier to read the full product names

In [117]:
[p for p in res.loc[mask, 'product']]

['New Hunting Optics Rifle Scope Gun Scope 6-24x50 AOE Red and Green Illuminated Scope with Free Mount',
 'gamo 632270254 blue flame .177 cal qty of 100 - blister',
 'Athlon Optics Argos BTR Riflescope 6-24x50mm, 30mm Main Tube, ATMR FFP IR MOA, Glass Etched Reticle, Black',
 'Hawke Vantage Rifle Scope 4x32 Airsoft',
 '2-Pack 45 Degree Angle Offset Picatinny Weaver Tactical Accessory Rail Mount',
 '30mm w 1 inch Black Scope mount Dual Rings Flat Top Rifle Scope Picatiny Rail tactical',
 'UTG Accu-Sync Offset Picatinny Rings',
 'Remington Express Air Rifle, .22 cal, black',
 'FIREFIELD IMPACT XL REFLEX SIGHT',
 'Tippmann Straight Shot Squeegee - 18 Overall / Fits up to 14" Barrel',
 'Pair of Tactical 30mm Low Profile 6 Bolt Scope Mount Ring Fits 20mm Weaver Picatinny Rail',
 'Crosman Quest SBD .22 Caliber NP2 Break Barrel Air Rifle with Scope, 1100fps',
 'Benjamin Marauder BP2564S PCP Air Rifle .25 Cal with All-Weather Stock',
 '15 RD Shotgun 20 Gauge Bandolier Ammo Shell Tactical Hunti

# Find Cluster Centroids

These can be used to map new data points to appropriate cluster and topic.

In [118]:
centroids = pd.concat([res, pd.DataFrame(red_embed)], axis=1).groupby('cluster').mean()

  centroids = pd.concat([res, pd.DataFrame(red_embed)], axis=1).groupby('cluster').mean()


In [119]:
centroids.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,66,67,68,69,70,71,72,73,74,75
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1,0.147585,0.139953,0.023633,-0.033596,0.056738,-0.037523,0.044599,0.044808,-0.014327,-0.039666,...,0.051258,-0.074684,-0.031309,0.051296,-0.139123,-0.090236,-0.052674,-0.133452,0.002834,-0.033352
0,0.381828,0.299048,-0.170124,-0.085946,-0.836977,0.337676,-1.613332,0.888362,0.223226,-0.462619,...,3.485204,-9.042322,0.277707,2.532179,3.299984,1.438267,1.254455,2.129905,-5.614459,-1.633856
1,-0.327185,-0.764371,0.509366,-0.955349,0.588277,-3.822357,-2.646991,7.25293,-32.730846,7.072433,...,2.08562,-1.21321,-1.060413,1.537021,1.271868,-0.452252,1.034945,0.223638,0.197163,-0.827515
2,-0.732102,-0.769812,0.312072,-0.183727,0.510707,-0.151482,0.838043,-0.075394,-0.217315,-0.086638,...,0.260863,-0.340385,-0.213838,0.428249,-0.205076,-0.593312,0.094182,0.601125,1.012561,0.682444
3,-0.859066,0.421199,-10.787209,0.356419,-0.095561,0.078927,-0.674393,0.299778,-0.589694,-0.75083,...,1.267379,-0.735662,0.204499,-0.246036,0.086134,0.934523,1.021445,0.723988,0.371036,-0.544935


In [49]:
import pickle

with open('../data/out/centroid-embeddings.p', 'wb') as fp:
    pickle.dump(centroids.to_dict('list'), fp)

# Find Similar Clusters

Here we could merge clusters with similarity >= some threshold.

In [128]:
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances

In [135]:
similarity = pd.DataFrame(
    np.triu(
        pairwise_distances(centroids, centroids)
    ),
    index=centroids.index,
    columns=centroids.index
)
similarity

cluster,-1,0,1,2,3,4,5,6,7,8,...,497,498,499,500,501,502,503,504,505,506
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1,0.0,36.93121,36.355030,28.705519,23.918032,17.919706,18.981153,11.048027,12.535349,16.007780,...,5.151122,4.589614,6.387964,8.176804,8.921467,5.343007,6.444138,5.247172,6.162733,6.638168
0,0.0,0.00000,53.190506,45.399200,43.412659,40.287041,40.809830,38.204563,37.465919,38.967403,...,37.279423,37.106831,38.031208,38.724789,38.818062,37.142166,38.049603,37.502785,38.000256,37.959042
1,0.0,0.00000,0.000000,45.909714,43.218044,41.027225,41.704586,38.426304,36.632164,41.280014,...,36.526176,36.469257,36.946331,37.600452,37.787182,36.784290,37.152824,36.890957,36.944038,37.022041
2,0.0,0.00000,0.000000,0.000000,38.299179,33.378529,34.640522,30.562750,31.759216,30.705700,...,29.392281,29.233475,30.279131,30.738728,30.961651,29.443916,30.138550,29.634518,30.102627,30.196043
3,0.0,0.00000,0.000000,0.000000,0.000000,29.430508,30.693285,25.380564,27.131254,29.074230,...,24.778847,24.614803,24.768507,25.356920,25.589378,24.501158,24.757166,24.455828,24.782135,24.936621
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
502,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,3.770355,1.842914,3.703002,4.540872
503,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.100099,1.804721,2.433176
504,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.487713,3.437826
505,0.0,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.659749


In [144]:
z = (similarity - np.eye(similarity.shape[0])).values
similar_clusters = np.argwhere((0 < z) & (z < 2))

In [145]:
len(similar_clusters)

306

In [146]:
similar_clusters[:10, :]

array([[17, 18],
       [17, 20],
       [17, 22],
       [18, 20],
       [20, 22],
       [21, 22],
       [32, 62],
       [32, 66],
       [33, 34],
       [40, 41]])

In [147]:
mask = res['cluster'] == 17
[x for x in res.loc[mask, 'product']]

['MiCare [15pk] - 1-Panel Dip Card Instant Urine Drug Test - Meth/Methamphetamine (mAMP/MET) #MI-WDMA-114',
 'DISCOVER 1 PANEL DIP CARD - (mAMP)',
 'MiCare - [15 Pack] Single Panel Dip Card Instant Urine Drug Test - Cocaine (COC) (MI-WDCO-114)',
 'MiCare [5pk] - 5-Panel Dip Card Instant Urine Drug Test - (COC/mAMP/OPI/OXY/THC) #MI-WDOA-354',
 'MiCare [2pk] - 10-Panel Dip Card Instant Urine Drug Test - (AMP/BAR/BZO/COC/mAMP/MTD/OPI/PCP/TCA/THC) #MI-WDOA-1104',
 'MiCare [10pk] - 1-Panel Dip Card Instant Urine Drug Test - Barbiturates (BAR) #MI-WDBA-114',
 'REVEAL 10 PANEL CUP w/ads - (THC/COC/AMP/OPI/mAMP/PCP/BAR/BZO/MTD/MDMA) (PH/SG/OX)',
 'MiCare [75pk] - 12-Panel Dip Card Instant Urine Drug Test - (AMP/BAR/BZO/COC/mAMP/MDMA/MTD/OPI/OXY/PCP/PPX/THC) #MI-WDOA-1124',
 'MiCare [2pk] - 10-Panel T-Cup Instant Urine Drug Test - (AMP/BAR/BZO/COC/mAMP/MDMA/MTD/OPI/PCP/THC) #MI-TDOA-3104',
 'MiCare [10pk] - 1-Panel Dip Card Instant Urine Drug Test - Marijuana/Cannabinoids (THC) #MI-WDTH-114',
 

In [148]:
mask = res['cluster'] == 18
[x for x in res.loc[mask, 'product']]

['Upgraded 2018 Food Dehydrator Jerky Maker - Electric Multi-Tier Food Preserver, Meat or Beef J Fruit & Vegetable Dryer with 6 Stackable Trays, High-Heat Circulation - (PKFD16)',
 'Elgi Ultra Dura+ Table Top Wet Grinder, 1.25 Liter',
 'TECHTONGDA Commercial Electric Meat Cutting Machine Slicer Cutter Steak Beef Pork Cuts QE 5mm Blade',
 'Della 1200W 10-Tray Food Dehydrator Nut Durable Meat Fruit Sausage Jerky Dryer, Stainless Steel',
 'Zimtown 7.5" 150W Electric Semi-automatic Belt Meat Vegetable Cheese Bread Blade Stainless Steel Slicer Machine',
 'Olde Thompson Wild West Steak Grinder',
 'HERCHR Professional Meat Slicer Machine Electric Portable Hand-held Meat-cutting Cutter Tools Commercial Kebab Wheel Blade Disc Cutter Kebab Knife Sliced Meat Gyros Knife',
 'Gideon Hand Crank Manual Meat Grinder with Powerful Suction Base',
 'Nesco Professional 600W 5-Tray Food Dehydrator, FD-75PR',
 'Electric Food Dehydrator Machine Stackable Food Dehydrator Machine Electric Multi Tier Fruit Vege

# Find Most Common Words in Cluster

These would be topics.  We're doing a simple frequency analysis (vice TF-IDF) as we expect documents to be similar, thus aren't interested in words that distinguish them from others in the clusters, but rather words that are common within the cluster.

Intuitively, these results make sense.

In [178]:
from collections import Counter
import re

In [189]:
mask = res['cluster'] == 17
bow = ''.join([p for p in res.loc[mask, 'product']]).split(' ')
c = Counter(bow)
c.most_common()[:5]

[('-', 25), ('Instant', 12), ('Urine', 12), ('Drug', 12), ('Test', 12)]

In [193]:
mask = res['cluster'] == 17
bow = re.findall(r'\w+', ''.join([p for p in res.loc[mask, 'product']]))
c = Counter(bow)
c.most_common()[:5]

[('Panel', 12), ('Instant', 12), ('Urine', 12), ('Drug', 12), ('Test', 12)]

In [150]:
mask = res['cluster'] == 18
bow = ' '.join([p for p in res.loc[mask, 'product']]).split(' ')
c = Counter(bow)
c.most_common()[:5]

[('Meat', 18),
 ('Food', 17),
 ('Grinder', 12),
 ('Electric', 10),
 ('Dehydrator', 9)]

In [188]:
mask = res['cluster'] == 18
bow = re.findall(r'\w+', ''.join([p for p in res.loc[mask, 'product']]))
c = Counter(bow)
c.most_common()[:5]

[('Meat', 19),
 ('Food', 17),
 ('Grinder', 11),
 ('Dehydrator', 10),
 ('Electric', 9)]

## Evaluating Results

I'm not sure that similarity/distance metric for measuring clustering similarity will produce desired results.  HDBSCAN creates a hierarchy and merges under a given distance threshold.  Perhaps we should embed cluster topics so that we can better search for semantic similarity of given topics?

This of the "similar" clusters (both cosine and euclidean) is pretty off.

Alternatively, HDBSCAN gives a method for converting to NetworkX data, so that we could extract parent-child relationships amongst clusters.

HDBSCAN gives a `predict` method, so we don't really need to map distance/similarity to centroids (manually). We really only care about what cluster an observation might belong to, and then what topic that cluster maps to.