# Clustering with word embedding + K-means++

In this chapter we start with discussing potential options for feature selections, then we do a thought experiment on how to trouble shooting a machine learning model with the given dataset, finally we will present a minimal word embedding solution. The details for the configurations are skipped, the important part is the high level ideas.

# Overview

**Let's have a look at an overview of several feature columns.**

Note: we will call `e_matched_tokens_categories_formatted` as `category_token` in this document

| Column name      | Size of vocabulary | Average length of text | Quality of content    | Difficulty of interpretation |
|------------------|--------------------|------------------------|-----------------------|------------------------------|
| long_description | ~50k words         | ~40 words              | low                   | hard                         |
| product_name     | ~10k words         | ~5 words               | medium                | medium                       |
| category_token   | ~200 words         | ~2 words               | high                  | easy                         |

**Notes on quality**

 * **long_description**

`long_description` often contains non relevant information.

For example, in the description of a Polo shirt we might see statements related to dilivery delay caused by Covid 19.

At the same time, the keyword `Covid 19` will appear in protective facemask as well.

These information will confuse our machine learning model, the model might incorrectly link a Polo shirt with a protective facemask because they share the common keyword `Covid 19`.

  * **product_name**

If we reflect on why TF-IDF is useful, we can see the core idea is it gives important keywords a higher weight, while punish those background noise words with a lower weight.

Since `product_name` is manually labeled by human, it naturally includes important keywords.

However, `product_name` also has its own issue.

|                |            |
|----------------|------------|
| black jumpsuit | black Polo |
| white jumpsuit | white Polo |


A black jumpsuit and a black Polo both share the keyword `black`, while a black jumpsuit and a white jumpsuit both share the keyword `jumpsuit`.

Which of them should be clustered to the same class?

Human often has strong preference to group `black jumpsuit` and `white jumpsuit` together, but machine might not necessary has the same bias as human.

Similarly, brand information, meterial information might also confuse a machine learning model.

  * **category_token**

Following this direction, we can see `category_token` is a more abstract and distilled summary comparing to `product_name`.

`category_token` seems coming from some sort of manual labeling or semi-auto labeling.

Here is an anecdotal evidence:

If we go to the retailer website referred in the dataset, we can see product pages like this:

[Retailer: Orchard Mile ](https://orchardmile.com/saylor/shelley-say5c90bc1d)

<img src="https://github.com/fracting/sku-clustering/raw/main/images/retailer_category.png" width="700" align="left">

In the given dataset, this product is labeled as `["Accessories Hair", "Headband"]`, which seems like a result of regexp matching with `Home / Saylor / Accessories / Hair Accessories / Shelley Headband`

Manually sampling 1% of the dataset (200 items) suggests that the `category_token` column is a very accurate (multi-label) classifier for the product.

This is not surprising, because the relailers had done the home work for us.

# Thought experiment

**How to trouble shooting a machine learning clustering model with the given dataset?**

Remember in README, we mentioned dimension reduction is a powerful technique for debugging machine learning model.

Examples include:

 * reduce number of features
 * reduce complexity of features
 * reduce size of input
 * reduce size of output
 * etc
 
Previous section presents the relation between different feature columns:

 * number_of_features(category_token) < number_of_features(product_name) < number_of_features(long_description)

 * complexity_of_features(category_token) < complexity_of_features(product_name) < complexity_of_features(long_description)

 * size_of_input(category_token) < size_of_input(product_name) < size_of_input(long_description)
 
etc

If we want to make our life easier, when a model doesn't work well with complex input data, we should always try with simpler input data instead.

This methodology leads us to investigate a toy model trained merely with category_token.

If we are allowed to cheat a little bit more, we should also decrease size_of_output.

The product dataset contains about 20k rows after cleaning up. If we only care about category_token, then those 20k rows are just different combinations of different category_tokens. Instead of build a clustering model to classify 20k combinations of 200 category_tokens, can we build a clustering model to classify those 200 category_tokens themselves?

# Coding time

In fact, the above thought experiment is exactly what happened during the investigation of the project.

As we mentioned in README, a lot of models didn't work outstandingly, so we reduce and reduce the complexity again and again.

Eventually we tried a word embedding feature extractor with a K-means++ clustering algorithm on 200 category_tokens, and this toy model was the most important step we had made.

We will skip the details for other large models, and go straightly to the toy model.

The dataset file was not uploaded to this github repo for information safety consideration.

In [1]:
import pandas as pd

df_sku = pd.read_json('exercise2.jl', lines=True)

def word_count(text):
    return len(text.split())

# Clean corrupted data, refer back to Chapter 2 for reason
df_sku['long_desc_wc'] = df_sku['long_description'].apply(word_count)
df_sku_clean = df_sku[df_sku['long_desc_wc'] != 19]

In [2]:
# Extract all category_token arrays to a list of arrays
category_tokens_list = df_sku_clean['e_matched_tokens_categories_formatted'].values.tolist()

# Flattern the list
from itertools import chain
category_tokens = list(chain.from_iterable(category_tokens_list))

# Convert to dataframe
df_category_token = pd.DataFrame({'category_token': category_tokens})
df_category_token_wc = df_category_token[['category_token']].value_counts().reset_index(name='cnt')

In [3]:
import spacy

# It might take sometime to download
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")

# Convert plural like `shirts` to singular form like `shirt`
# Convert past voice like `short sleeved` to original form like `short sleeve`
def text_normalization(text):
    text = text.lower()
    text = text.replace('-',' ')
    doc = nlp(text)
    processed = lemmatizer(doc)
    lemmatized_text = ' '.join([x.lemma_ for x in processed])
    return lemmatized_text

df_category_token_wc['category_token_normalized'] = \
    df_category_token_wc['category_token'].apply(text_normalization)

In [4]:
df_category_token_normalized_wc = \
    df_category_token_wc.groupby('category_token_normalized') \
                        .sum() \
                        .sort_values('cnt', ascending=False) \
                        .reset_index()

**Now let's have a look at the normalized category tokens**

In [5]:
df_category_token_normalized_wc

Unnamed: 0,category_token_normalized,cnt
0,polo,5091
1,jumpsuit,4761
2,polo shirt,3860
3,romper,2795
4,shirt,2267
...,...,...
142,headwrap,1
143,hike,1
144,runner,1
145,padfolio,1


**Next we will use fasttext pretrained model to convert category tokens to embedding vectors**

In [6]:
import fasttext
import fasttext.util

# 6.7 GB, it might take 10 mins the first time downloading the pretrained model
fasttext.util.download_model('en', if_exists='ignore')

fasttext_model = fasttext.load_model('cc.en.300.bin')

df_category_token_normalized_wc['vector'] = \
    df_category_token_normalized_wc['category_token_normalized'].apply(fasttext_model.get_sentence_vector)



In [7]:
df_category_token_normalized_wc

Unnamed: 0,category_token_normalized,cnt,vector
0,polo,5091,"[0.1369399, -0.0060291286, -0.095844276, 0.065..."
1,jumpsuit,4761,"[-0.004546515, -0.022176884, -0.024613941, 0.0..."
2,polo shirt,3860,"[0.12262818, 0.0032589443, -0.050291095, 0.071..."
3,romper,2795,"[0.077759564, -0.035508834, -0.053101204, 0.08..."
4,shirt,2267,"[0.108316466, 0.012547017, -0.0047379117, 0.07..."
...,...,...,...
142,headwrap,1,"[0.015347554, -0.08992355, 0.06988579, 0.05231..."
143,hike,1,"[0.06327072, 0.03968756, -0.00056086, 0.048862..."
144,runner,1,"[-0.0069131004, -0.017952213, -0.02319522, 0.0..."
145,padfolio,1,"[0.10828615, -0.027963176, 0.032500546, 0.0402..."


**Finally we use k-means++ to cluster the tokens**

In [8]:
from sklearn.cluster import KMeans

num_clusters = 5
category_vectors = df_category_token_normalized_wc['vector'].values.tolist()
kmeans = KMeans(n_clusters=num_clusters,
                random_state=0,
                init='k-means++',
                n_init=100).fit(category_vectors)

In [9]:
df_category_token_normalized_wc['class_id'] = kmeans.labels_

df_category_token_normalized_wc['score'] = \
    df_category_token_normalized_wc['vector'].apply(lambda x: kmeans.score(x.astype(float).reshape(1,-1)))

In [10]:
def show_class(df, class_id):
    return df[df['class_id']==class_id].sort_values('score', ascending=False)

# Output of clustering

Since we've greatly reduced the dimension, there are only less than 200 category tokens remained.

Now we can even manually inspect all of them.

In [11]:
show_class(df_category_token_normalized_wc, 0)

Unnamed: 0,category_token_normalized,cnt,vector,class_id,score
3,romper,2795,"[0.077759564, -0.035508834, -0.053101204, 0.08...",0,-0.374032
20,onesie,225,"[0.034853995, 0.021573989, -0.03989922, 0.0640...",0,-0.404997
79,pajama,18,"[0.029283147, 0.008668089, -0.052012667, 0.116...",0,-0.408194
1,jumpsuit,4761,"[-0.004546515, -0.022176884, -0.024613941, 0.0...",0,-0.419991
122,pyjama,4,"[0.051504638, 0.0064405226, -0.096497536, 0.12...",0,-0.421447
15,playsuit,445,"[0.04652863, -0.021274956, -0.017050577, 0.038...",0,-0.437983
111,nightgown,7,"[0.033360124, -0.023491126, -0.031441677, 0.04...",0,-0.455988
4,shirt,2267,"[0.108316466, 0.012547017, -0.0047379117, 0.07...",0,-0.456333
124,pinafore,4,"[0.06022062, 0.009270296, -0.02809044, 0.10194...",0,-0.507523
141,shortsleeve,1,"[0.08988393, -0.05184712, -0.03816318, 0.01907...",0,-0.511569


In [12]:
show_class(df_category_token_normalized_wc, 1)

Unnamed: 0,category_token_normalized,cnt,vector,class_id,score
9,hair accessory,978,"[0.10925885, 0.016616622, 0.0046025505, 0.1052...",1,-0.233285
10,accessory hair,951,"[0.10925885, 0.016616622, 0.0046025505, 0.1052...",1,-0.233285
18,hair clip,273,"[0.112400554, 0.06426445, -0.0051781703, 0.082...",1,-0.239992
86,hair pin,15,"[0.049266607, 0.0057037314, 0.031365875, 0.098...",1,-0.240514
106,hair elastic,8,"[0.04642378, -0.023742367, 0.024565283, 0.0884...",1,-0.275058
45,hair tie,56,"[0.028725307, -0.030694835, 0.03806801, 0.1022...",1,-0.278656
94,hair wrap,12,"[-0.0035697967, -0.04907577, 0.013940261, 0.11...",1,-0.282579
96,hair slide,11,"[0.072968, 0.012904251, 0.011043343, 0.1028609...",1,-0.307365
93,hair band,13,"[0.036252037, -0.004781616, -0.030113734, 0.07...",1,-0.310957
66,ponytail holder,25,"[0.028387446, -0.019611627, -0.013860179, 0.05...",1,-0.328204


In [13]:
show_class(df_category_token_normalized_wc, 2)

Unnamed: 0,category_token_normalized,cnt,vector,class_id,score
69,lip conditioner,24,"[-0.042885996, -0.013935762, 0.014158999, 0.07...",2,-0.315814
31,face cream,141,"[0.03783966, 0.009735208, -0.068710856, 0.0884...",2,-0.324982
64,face oil,26,"[-0.040884756, 0.03653248, -0.031320754, 0.076...",2,-0.369328
73,lip care,22,"[-0.01084067, 0.00085377553, 0.05987587, 0.062...",2,-0.378483
75,acne treatment,22,"[0.02090497, 0.03427422, 0.0024520592, 0.04104...",2,-0.384247
56,skin age,33,"[0.08274053, 0.04169998, 0.030447911, 0.072600...",2,-0.396793
25,lip balm,184,"[-0.017611254, -0.01346375, 0.018388636, 0.051...",2,-0.401134
37,moisturizer,110,"[0.0046942267, 0.012510886, -0.047919646, 0.05...",2,-0.413482
91,night treatment,13,"[-0.062050186, 0.06539865, 0.0061826804, 0.051...",2,-0.447263
6,face mask,1400,"[-0.038631476, 0.009795161, -0.03587358, 0.079...",2,-0.461839


In [14]:
show_class(df_category_token_normalized_wc, 3)

Unnamed: 0,category_token_normalized,cnt,vector,class_id,score
68,sport polo,24,"[0.09784104, 0.008656737, -0.061645824, 0.0738...",3,-0.266233
105,golf shirt,9,"[0.10114078, -0.016256353, -0.040513292, 0.074...",3,-0.316339
2,polo shirt,3860,"[0.12262818, 0.0032589443, -0.050291095, 0.071...",3,-0.367792
58,polo top,31,"[0.0352564, 0.044426206, -0.091478795, 0.08856...",3,-0.389891
59,soccer,30,"[0.029949285, 0.008232417, -0.049817447, 0.074...",3,-0.459266
108,basketball,8,"[0.062035482, 0.01703059, -0.05455231, 0.05537...",3,-0.508571
27,tennis,179,"[0.05499815, 0.032198373, -0.09364502, 0.09065...",3,-0.514507
101,football,10,"[-0.033197343, 0.012118939, -0.042949986, 0.05...",3,-0.533646
0,polo,5091,"[0.1369399, -0.0060291286, -0.095844276, 0.065...",3,-0.545554
24,athletic,191,"[0.06001214, -0.0881124, -0.031627744, 0.07229...",3,-0.54991


In [15]:
show_class(df_category_token_normalized_wc, 4)

Unnamed: 0,category_token_normalized,cnt,vector,class_id,score
103,I pad case,9,"[0.0068292394, -0.00043960413, 0.060589682, 0....",4,-0.251652
118,bag with,5,"[0.05195016, -0.018661324, 0.0035954742, 0.049...",4,-0.312706
137,full length sleeve,1,"[0.038707234, 0.019720688, -0.05021169, 0.0517...",4,-0.325261
136,envelope case,1,"[-0.020284563, 0.009617137, 0.09644465, 0.0782...",4,-0.338867
146,zip around closure,1,"[-0.010715639, -0.052582975, 0.025139077, 0.05...",4,-0.340095
112,tablet case,6,"[-6.582588e-05, 0.04337097, 0.08402235, 0.0830...",4,-0.344318
82,laptop case,17,"[-0.025039159, 0.03716725, 0.081087366, 0.0851...",4,-0.346056
54,laptop sleeve,36,"[0.008130789, 0.01809328, 0.0105188405, 0.0592...",4,-0.347598
97,tablet sleeve,11,"[0.03310412, 0.024296997, 0.013453822, 0.05715...",4,-0.359679
135,tablet cover,1,"[0.0010241214, 0.050637156, 0.068282634, 0.069...",4,-0.359844


# Observations

After manually reviewed all category tokens, a few observations were found.

There are obviously 5 classes:
 * class_0: female closing like jumpsuit, romper
 * class_1: hair accessories and jewelry	
 * class_2: beauty and skin care
 * class_3: polo, sport ware and balls
 * class_4: computer cases and ... hmm, something is wrong here

The clustering results are not bad, but far from perfect.

If we take a closer look, we can find very interesting things

 * `shortsleeve` is mis-classified into the `jumpsuit` class, because `jumpsuit` should be `long sleeve`. It makes more sense to put `shortsleeve` into the `polo` class
 * similarly, `shirt` is mis-classified into the `jumpsuit` class. It makes more sense to put `shirt` and `polo` together
 * more funny, `long sleeve` is mis-classified into the `laptop sleeve` class, they are very different things depite they share the common word `sleeve`

**In fact, these type of errors were very common when we were working on larger models with more complicated inputs.**

The dimension reduction strategy we applied greatly simplify the output but still effectly reproduce the original defect.

In other words, it's a minimal test case to reproduce the original issue.

# Puzzles

Despite the clustering results are not terrible, there remains some puzzles.

 * Why some mis-classifications happened?
 
 * Can we use the output from the toy model directly on the original problem?

Remember we spent a lot of effort to reduce the dimension again and again from the original problem until we get a toy model. How the toy model help diagnosing the original problem?

Also don't forget the original challenge was asking for clustering of the product dataset, instead of the categorie tokens.

We are very closed. We will take a break and answer these questions in the last chapter.

**Before getting to the next chapter, let's cache the dataset to avoid duplicate works**

In [16]:
# The dataset file was not uploaded to this github repo for information safety consideration.
df_category_token_normalized_wc.to_csv('df_category_token_normalized_wc.csv', index=False)
df_category_token_wc.to_csv('df_category_token_wc.csv', index=False)
df_sku_clean.to_csv('df_sku_clean.csv', index=False)