# Problem Statement

Within each name group (barcodes), there might be outliers where the product name looks clean, but intuitively differs from other names in the group.

Example given:
* SEAGULL NATPH 25g WRNA / PCS
* SEA GULL WARNA RENTENG
* MANGKOK SAMBALL ALL VAR
* SEAQULL NAPT WARNA 25GR
* \- SEA GULL NAPHT 25GR SG-519W 1PCSX 1.500,00:

I am to formulate a solution using Data Science techniques to create a solution to this problem.

# Assessment Plan

The product names are given as a string, where there might include many redundant information, such as quantity or single letters etc.

1. Remove unnecessary information from product name using Named Entity Recognition: 
    * QTY (quantity)
    * PROD (product name)
    * ADJ (adjective/descriptors)
    * O (not important words)
2. Use ensemble of embeddings to convert strings/texts into numbers
    * TF-IDF
    * Bert embeddings
3. Similarity score on the embeddings using various distance functions
4. Outlier detection using weighted average of the features

# Data preprocessing

Receipt Cord Data

Link: <https://huggingface.co/datasets/naver-clova-ix/cord-v2>

Dataset description: Collection of receipts (both image and text extracted) from Indonesian retailers and restaurants

Reasons for using this dataset:
- Indonesian receipts collected from shops and restaurants
- Collected from real retailers
- Matches problem domain where there is noise in the product names.

In [1]:
import pandas as pd
import os

In [2]:
# Download file is split into parquets, so we have to join them
df_list = []
for filepath in os.listdir('receipt_cord/'):
    df = pd.read_parquet(os.path.join('receipt_cord', filepath))
    df.drop(columns=['image'], inplace=True) # remove image column since this is OCR dataset
    df['split'] = 'train' if 'train' in filepath else 'test' if 'test' in filepath else 'val'
    df_list.append(df)

In [3]:
# create one dataframe
df = pd.concat(df_list, ignore_index=True)
df.shape

(1000, 2)

In [4]:
# parse nested json to get product name
import ast

def get_prod_names(nested_json):
    try:
        nested_json = ast.literal_eval(nested_json)
    except:
        return None
    
    if 'gt_parse' in nested_json:
        if 'menu' in nested_json['gt_parse']:
            if isinstance(nested_json['gt_parse']['menu'], list):
                try:
                    product_names = list(map(lambda x: x['nm'], nested_json['gt_parse']['menu']))
                    return product_names
                except:
                    return None
            else:
                try:
                    return [nested_json['gt_parse']['menu']['nm']]
                except:
                    return None
    else:
        return None

In [5]:
df['product_name'] = df['ground_truth'].apply(get_prod_names)

In [6]:
df.dropna(inplace=True)

In [7]:
df = df.explode('product_name').reset_index(drop=True)

In [8]:
df.drop(columns=['ground_truth'], inplace=True)

In [9]:
# remove rows where product names are list
df['type'] = df['product_name'].apply(lambda x: isinstance(x, list))

In [10]:
df = df[df.type==False]

In [11]:
df

Unnamed: 0,split,product_name,type
0,test,-TICKET CP,False
1,test,J.STB PROMO,False
2,test,Y.B.BAT,False
3,test,Y.BASO PROM,False
4,test,JASMINE MT ( L ),False
...,...,...,...
2563,val,PAHA BAWAH,False
2564,val,Choco Cheese,False
2565,val,Lemon Tea (L),False
2566,val,Hulk Topper Package,False


## For labelling

Saving the product names into a "\n" separated text file for [spacy's annotation tagger](https://tecoholic.github.io/ner-annotator/)

In [12]:
# text_file = open("for_labelling_2.txt", "w")
# product_list = df.product_name.unique().tolist()[1004:]
# product_list = list(set(product_list))
# print(len(product_list))
# n = text_file.write('\n'.join(product_list))
# text_file.close()

In [13]:
# stopped at 1004 row

In [14]:
# stopped at 1004 + 224

## Using labelled annotations

Preprocessing the annotations to usable format

In [15]:
import pandas as pd
import json

In [16]:
with open('annotations.json') as json_file:
    data = json.load(json_file)

In [17]:
with open('annotations_2.json') as json_file:
    add_data = json.load(json_file)

In [18]:
annotations = data['annotations']

In [19]:
annotations += add_data['annotations']

In [20]:
annotations = [i for i in annotations if i is not None]

In [21]:
text = list(map(lambda x: x[0].replace('\r', ''), annotations))
entities = list(map(lambda x: x[1]['entities'], annotations))

In [22]:
df = pd.DataFrame({'text': text, 'entities': entities})
df

Unnamed: 0,text,entities
0,J.STB PROMO,"[[0, 5, PROD], [6, 11, ADJ]]"
1,Y.B.BAT,"[[0, 7, PROD]]"
2,Y.BASO PROM,"[[0, 6, PROD], [7, 11, ADJ]]"
3,JASMINE MT ( L ),"[[0, 10, PROD], [11, 16, ADJ]]"
4,DONAT GULA,"[[0, 5, PROD], [6, 10, ADJ]]"
...,...,...
1053,DADAR PISANG,"[[0, 12, PROD]]"
1054,JAPANEESE GR TEA HOT,"[[0, 16, PROD], [17, 20, ADJ]]"
1055,<FC Winger HC,"[[4, 10, PROD]]"
1056,SUPER CHEESE,"[[0, 12, PROD]]"


In [23]:
df = df.groupby('text')['entities'].first().reset_index()

In [24]:
def split_text_to_tags(row):
    text = row['text']
    entities = row['entities']
    
    all_tags = []
    left = 0
    while len(entities) > 0:
        if left!=entities[0][0]:
            all_tags.append([left, entities[0][0], 'O'])
        all_tags.append(entities[0])
        left = entities[0][1]
        entities = entities[1:]
    if left!=len(text):
        all_tags.append([left, len(text), 'O'])
    return all_tags

In [25]:
df['tags'] = df.apply(split_text_to_tags, axis=1)

In [26]:
df = df.explode('tags').reset_index().rename(columns={'index': 'text_index'})
df.dropna(inplace=True)
df['entity'] = df.apply(lambda row: row['text'][row['tags'][0]: row['tags'][1]], axis=1)
df = df[df.entity!=' '].reset_index(drop=True)

In [27]:
df['label'] = df['tags'].apply(lambda x: x[2])

In [28]:
import re
df['entity'] = df['entity'].apply(lambda x: re.split(r"([^a-zA-Z0-9])", x))

In [29]:
df = df.explode('entity').reset_index()
df = df[~df.entity.isin(['', ' '])].reset_index(drop=True)

In [30]:
df['pos'] = df.groupby(['index']).cumcount()+1

In [31]:
df['pos'] = df.apply(lambda row: '' if row['label']=='O' else 'B-' if row['pos']==1 else 'I-', axis=1)

In [32]:
df['pos_tag'] = df['pos'] + df['label']

In [33]:
df = df[['text_index', 'text', 'entity', 'pos_tag']]

In [34]:
data = df.groupby('text_index')['pos_tag'].apply(list).reset_index()

In [35]:
new_df = df.groupby('text_index')['text'].first().reset_index()

In [36]:
data = pd.merge(data, new_df, on='text_index')

In [37]:
data

Unnamed: 0,text_index,pos_tag,text
0,0,"[O, O, B-PROD, B-PROD, O]",#F11 CREAM HAMBURG 0
1,1,"[O, B-PROD]",#PKTPOLSBTSPON2S
2,2,"[O, O, O, B-PROD, I-PROD, I-PROD]",(TA) BIHUN GORENG SEAFOOD
3,3,"[O, O, O, B-PROD, I-PROD, I-PROD]",(TA) KWETIAW SEAFOOD SIRAM
4,4,"[O, O, O, B-PROD, I-PROD]",(TA) NASI GORENG
...,...,...,...
933,933,"[B-ADJ, B-PROD]",green tea
934,934,"[B-PROD, I-PROD]",phad thai
935,935,"[B-ADJ, B-PROD, I-PROD]",red curry beef
936,936,"[B-PROD, I-PROD]",steamed rice


Final dataframe where pos tags are how Bert interprets the data

In [38]:
# data.to_csv('labelled_data_new.csv', index=False)

## Split data and Define Unique Labels

In [39]:
data = pd.read_csv('labelled_data_new.csv')

In [40]:
import ast

In [41]:
data['pos_tag'] = data['pos_tag'].apply(ast.literal_eval)

In [42]:
data

Unnamed: 0,text_index,pos_tag,text
0,0,"[O, O, B-PROD, B-PROD, O]",#F11 CREAM HAMBURG 0
1,1,"[O, B-PROD]",#PKTPOLSBTSPON2S
2,2,"[O, O, O, B-PROD, I-PROD, I-PROD]",(TA) BIHUN GORENG SEAFOOD
3,3,"[O, O, O, B-PROD, I-PROD, I-PROD]",(TA) KWETIAW SEAFOOD SIRAM
4,4,"[O, O, O, B-PROD, I-PROD]",(TA) NASI GORENG
...,...,...,...
933,933,"[B-ADJ, B-PROD]",green tea
934,934,"[B-PROD, I-PROD]",phad thai
935,935,"[B-ADJ, B-PROD, I-PROD]",red curry beef
936,936,"[B-PROD, I-PROD]",steamed rice


In [43]:
data.rename(columns={'pos_tag': 'labels'}, inplace=True)

In [44]:
import numpy as np

# split labelled data into train, test, validation
df_train, df_val, df_test = np.split(data.sample(frac=1, random_state=42),
                            [int(.8 * len(data)), int(.9 * len(data))])

In [45]:
# df_train.to_csv('df_train.csv', index=False)
# df_val.to_csv('df_val.csv', index=False)
# df_test.to_csv('df_test.csv', index=False)

# Named Entity Recognition

Bert is used only for NER portion of the assessment. Training and inference is done on `train_script.py`

Reasons for using Bert:
- Many existing pre-trained models available
- Still one of the top few models in NER
- fine-tuning allows for small dataset + shorter training time

Alternative models:
- ELMo
- Spacy
- NLTK

In [46]:
# load preds
data_preds = pd.read_csv('data_with_preds.csv')
data_preds['labels'] = data_preds['labels'].apply(ast.literal_eval)
data_preds['preds'] = data_preds['preds'].apply(ast.literal_eval)

# Embeddings

Using the predicted tags, remove all but product tags

In [47]:
import re
data_preds['text_tokenized'] = data_preds['text'].apply(lambda x: re.split(r"([^a-zA-Z0-9])", x))
data_preds['text_tokenized'] = data_preds['text_tokenized'].apply(lambda x: [i for i in x if i not in ['',' ']])

In [48]:
data_preds['match'] = data_preds.apply(lambda row: len(row['preds']) == len(row['text_tokenized']), axis=1)

In [49]:
data_preds = data_preds[data_preds.match==True].reset_index(drop=True)

In [50]:
data_preds

Unnamed: 0,text_index,labels,text,preds,text_tokenized,match
0,70,"[B-PROD, B-ADJ]",AYAM GORENG,"[B-PROD, B-ADJ]","[AYAM, GORENG]",True
1,331,"[B-PROD, I-PROD, I-PROD]",GRAINS PAN BREAD,"[B-PROD, I-PROD, I-PROD]","[GRAINS, PAN, BREAD]",True
2,858,"[B-PROD, I-PROD]",TWIST DONUT,"[B-PROD, I-PROD]","[TWIST, DONUT]",True
3,495,"[B-BRAND, B-PROD, I-PROD, O]",MAGNUM WHT ALMND 80,"[B-ADJ, B-PROD, I-PROD, O]","[MAGNUM, WHT, ALMND, 80]",True
4,209,"[O, O, B-PROD, O, B-ADJ, O, O]",CT06.TEA 0 ICE R.,"[O, O, B-PROD, O, B-ADJ, O, O]","[CT06, ., TEA, 0, ICE, R, .]",True
...,...,...,...,...,...,...
933,106,"[B-PROD, O, B-PROD]",BASO TAHU KWETIAU,"[B-PROD, I-PROD, I-PROD]","[BASO, TAHU, KWETIAU]",True
934,270,"[B-PROD, B-PROD]",DONAT AYAM,"[B-PROD, B-PROD]","[DONAT, AYAM]",True
935,860,"[B-BRAND, B-PROD, I-PROD]",TWIST STRAWBERRY DONUT,"[O, B-ADJ, B-PROD]","[TWIST, STRAWBERRY, DONUT]",True
936,435,"[B-PROD, I-PROD, B-ADJ, I-ADJ, I-ADJ]",KITSUNE UDON (KIDS),"[B-PROD, I-PROD, O, I-ADJ, O]","[KITSUNE, UDON, (, KIDS, )]",True


In [51]:
def get_prod_names(row):
    keep = []
    for i in range(len(row['preds'])):
        if 'PROD' in row['preds'][i]:
            keep.append(row['text_tokenized'][i])
    return keep

In [52]:
def get_adj(row):
    keep = []
    for i in range(len(row['preds'])):
        if 'ADJ' in row['preds'][i]:
            keep.append(row['text_tokenized'][i])
    return keep

In [53]:
data_preds['adj_tags'] = data_preds.apply(get_adj, axis=1)
data_preds['product_tags'] = data_preds.apply(get_prod_names, axis=1)

In [54]:
data_preds['product_name'] = data_preds['product_tags'].apply(lambda x: ' '.join(x))
data_preds['adj'] = data_preds['adj_tags'].apply(lambda x: ' '.join(x))

In [55]:
data_preds.shape

(938, 10)

# Create product groups

Simulate 3 product groups with 4 outliers in each, 2 from each of other groups

In [56]:
data_preds['product_group'] = data_preds['text'].apply(lambda x: 'lemon' if 'lemon' in x.lower() else 'ayam' if 'ayam' in x.lower() else 'cheese' if 'cheese' in x.lower() else None)

In [57]:
data_preds.product_group.value_counts()

product_group
ayam      29
cheese    27
lemon     15
Name: count, dtype: int64

In [58]:
lemon_group = data_preds[data_preds.product_group=='lemon'].reset_index()
ayam_group = data_preds[data_preds.product_group=='ayam'].reset_index()
cheese_group = data_preds[data_preds.product_group=='cheese'].reset_index()

In [59]:
lemon_group = pd.concat([lemon_group, ayam_group.sample(2), cheese_group.sample(2, random_state=1234)], ignore_index=True)
ayam_group = pd.concat([ayam_group, lemon_group.sample(2), cheese_group.sample(2, random_state=1234)], ignore_index=True)
cheese_group = pd.concat([cheese_group, ayam_group.sample(2), lemon_group.sample(2, random_state=1234)], ignore_index=True)

In [60]:
lemon_group.drop(columns=['index', 'text_index', 'text_tokenized', 'match', 'adj_tags', 'product_tags'], inplace=True)
ayam_group.drop(columns=['index', 'text_index', 'text_tokenized', 'match', 'adj_tags', 'product_tags'], inplace=True)
cheese_group.drop(columns=['index', 'text_index', 'text_tokenized', 'match', 'adj_tags', 'product_tags'], inplace=True)

## Convert to embedding

We will use 2 methods to convert words into embeddings (numbers):
1. TF-IDF
2. Bert Model

### TF-IDF

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc)  in a document amongst a collection of documents (also known as a corpus).
* TF: Number of times the word appears in a document (raw count)
* IDF: Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus.

TF-IDF can be calculated using word level or char level. These can be used as 2 different features. In this example, I used word level TF-IDF.

#### Product name as feature

Using only the product name to calculate cosine similarity. Each element in the group will have pairwise cosine sim with the rest of the group, and mean is taken (excluding itself because it will be 1.0)

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [62]:
groups = [lemon_group, ayam_group, cheese_group]

In [63]:
group_tfidf = []
features = []
for group in groups:
    vectorizer = TfidfVectorizer(analyzer='word')
    vectors = vectorizer.fit_transform(group.product_name.tolist())
    feature_names = vectorizer.get_feature_names_out()
    dense = vectors.todense()
    denselist = dense.tolist()
    group_tfidf.append(pd.DataFrame(denselist, columns=feature_names))
    features.append(feature_names)

In [64]:
for i in range(len(groups)):
    group_tfidf[i]['product_name'] = groups[i].product_name.tolist()

In [65]:
from sklearn.metrics.pairwise import cosine_similarity
import scipy

In [66]:
group_sparse_mat = []
for i in range(len(group_tfidf)):
    prod_sparse = scipy.sparse.csr_matrix(group_tfidf[i][features[i].tolist()].values)
    group_sparse_mat.append(prod_sparse)

In [67]:
def mean_cos_sim(row, list_final):
    prod_name = row.name
    cols = row.index.tolist()
    cols.remove(prod_name)
    mean_cos_sim = np.mean(row[cols])
    list_final.append(mean_cos_sim)

In [68]:
grp_cos_sim = []
for i in range(len(groups)):
    cosine_sim = pd.DataFrame(cosine_similarity(group_sparse_mat[i]))
    cosine_sim.index = groups[i].text.tolist()
    cosine_sim.columns = groups[i].text.tolist()
    mean_cos_sim_list = []
    cosine_sim.apply(lambda x: mean_cos_sim(x, mean_cos_sim_list), axis=1)
    grp_cos_sim.append(mean_cos_sim_list)

In [69]:
for i in range(len(groups)):
    groups[i]['prod_tfidf_cosine_sim'] = grp_cos_sim[i]

#### Adj as feature

In [70]:
group_tfidf = []
features = []
for group in groups:
    vectorizer = TfidfVectorizer(analyzer='word')
    vectors = vectorizer.fit_transform(group.adj.tolist())
    feature_names = vectorizer.get_feature_names_out()
    dense = vectors.todense()
    denselist = dense.tolist()
    group_tfidf.append(pd.DataFrame(denselist, columns=feature_names))
    features.append(feature_names)

In [71]:
for i in range(len(groups)):
    group_tfidf[i]['adj'] = groups[i].product_name.tolist()

In [72]:
group_sparse_mat = []
for i in range(len(group_tfidf)):
    prod_sparse = scipy.sparse.csr_matrix(group_tfidf[i][features[i].tolist()].values)
    group_sparse_mat.append(prod_sparse)

In [73]:
def mean_cos_sim(row, list_final):
    prod_name = row.name
    cols = row.index.tolist()
    cols.remove(prod_name)
    mean_cos_sim = np.mean(row[cols])
    list_final.append(mean_cos_sim)

In [74]:
grp_cos_sim = []
for i in range(len(groups)):
    cosine_sim = pd.DataFrame(cosine_similarity(group_sparse_mat[i]))
    cosine_sim.index = groups[i].text.tolist()
    cosine_sim.columns = groups[i].text.tolist()
    mean_cos_sim_list = []
    cosine_sim.apply(lambda x: mean_cos_sim(x, mean_cos_sim_list), axis=1)
    grp_cos_sim.append(mean_cos_sim_list)

In [75]:
for i in range(len(groups)):
    groups[i]['adj_tfidf_cosine_sim'] = grp_cos_sim[i]

### Bert embeddings

Making use of pretrained Bert to do word embeddings

In [76]:
from transformers import BertModel, BertTokenizerFast
import torch

# Load tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Load pre-trained model (weights)
emb_model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
emb_model.eval()

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

#### Product names for Bert embeddings

In [77]:
for group in groups:
    texts = group.product_name.tolist()
    tokenized_text = list(map(tokenizer.tokenize, texts))
    indexed_tokens = list(map(tokenizer.convert_tokens_to_ids, tokenized_text))
    
    segments_ids = list(map(lambda x: [1] * len(x), indexed_tokens))
    
    tokens_tensor = [torch.tensor([x]) for x in indexed_tokens]
    segments_tensors = [torch.tensor([x]) for x in segments_ids]
    
    outputs = []
    for i in range(len(tokens_tensor)):
        with torch.no_grad():
            outputs.append(emb_model(tokens_tensor[i], segments_tensors[i]))
            
    token_vecs = list(map(lambda x: x[2][-2][0], outputs))
    sentence_embeddings = list(map(lambda x: torch.mean(x, dim=0), token_vecs))
    
    sparse_bert = scipy.sparse.csr_matrix(torch.stack(sentence_embeddings))
    cosine_sim = pd.DataFrame(cosine_similarity(sparse_bert))
    cosine_sim.columns = group.text.unique().tolist()
    cosine_sim.index = group.text.unique().tolist()
    
    mean_cos_sim_list = []
    cosine_sim.apply(lambda x: mean_cos_sim(x, mean_cos_sim_list), axis=1)
    
    group['prodname_bert_cosine_sim'] = mean_cos_sim_list

## Fuzzy Matching

### Levenshtein Distance

Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

The higher the Levenshtein distance, the more **unsimilar** the words

In [78]:
import Levenshtein as lev

In [79]:
for group in groups:
    product_names = group.product_name.unique().tolist()
    text_mat = {}
    for i in product_names:
        curr_row = []
        for j in product_names:
            curr_row.append(lev.distance(i, j))
        text_mat[i] = curr_row
    
    mean_lev_list = []
    for i, row in group.iterrows():
        prod_name = row.product_name
        mean_lev = np.mean(text_mat[prod_name])
        mean_lev_list.append(mean_lev)
        
    mean_lev_list = list(map(lambda x: (1/x), mean_lev_list))
    max_lev_dist = max(mean_lev_list)
    mean_lev_list = list(map(lambda x: x/max_lev_dist, mean_lev_list))
    
    group['prodname_lev_dist'] = mean_lev_list

### Jaro Winkler

The Jaro–Winkler distance uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length ℓ.

The higher the Jaro Winkler distance, the more **similar** the words

In [80]:
import jaro

In [81]:
for group in groups:
    product_names = group.product_name.unique().tolist()
    text_mat = {}
    for i in product_names:
        curr_row = []
        for j in product_names:
            curr_row.append(1 - jaro.jaro_winkler_metric(i, j))
        text_mat[i] = curr_row
    
    mean_cos_sim_list = []
    for i, row in group.iterrows():
        prod_name = row.product_name
        mean_cos_sim = np.mean(text_mat[prod_name])
        mean_cos_sim_list.append(mean_cos_sim)
    
    group['prodname_jaro_dist'] = mean_cos_sim_list

# Final anomaly detection

In [82]:
weights = {
    'prod_tfidf_cosine_sim': 0.35,
    'adj_tfidf_cosine_sim': 0.2,
    'prodname_bert_cosine_sim': 0.35, 
    'prodname_lev_dist': 0.05, 
    'prodname_jaro_dist': 0.05
}

In [83]:
for group in groups:
    group['overall_similarity'] = group['prod_tfidf_cosine_sim']*weights['prod_tfidf_cosine_sim'] + group['adj_tfidf_cosine_sim']*weights['adj_tfidf_cosine_sim'] + group['prodname_bert_cosine_sim']*weights['prodname_bert_cosine_sim'] + group['prodname_lev_dist']*weights['prodname_lev_dist'] + group['prodname_jaro_dist']*weights['prodname_jaro_dist']

In [84]:
lemon_group.sort_values(by='overall_similarity')

Unnamed: 0,labels,text,preds,product_name,adj,product_group,prod_tfidf_cosine_sim,adj_tfidf_cosine_sim,prodname_bert_cosine_sim,prodname_lev_dist,prodname_jaro_dist,overall_similarity
17,"[B-PROD, I-PROD]",Cheese Tart,"[B-PROD, I-PROD]",Cheese Tart,,cheese,0.0,0.0,0.545376,0.719512,0.535482,0.253631
18,"[B-PROD, B-PROD, B-PROD]",APPLE CREAMCHEESE PASTRY,"[B-PROD, B-PROD, I-PROD]",APPLE CREAMCHEESE PASTRY,,cheese,0.0,0.0,0.738729,0.393333,0.557525,0.306098
15,"[B-PROD, I-PROD, I-PROD]",Nasi Uduk Ayam,"[B-PROD, I-PROD, I-PROD]",Nasi Uduk Ayam,,ayam,0.033675,0.0,0.742928,0.631016,0.530751,0.329899
11,"[O, B-PROD, B-PROD]",RTD Lemongrass Aloe,"[O, B-PROD, B-PROD]",Lemongrass Aloe,,lemon,0.0,0.0,0.793153,0.698225,0.377544,0.331392
16,"[B-PROD, I-PROD, I-PROD]",Nasi Ayam Dewata,"[B-PROD, I-PROD, I-PROD]",Nasi Ayam Dewata,,ayam,0.033675,0.0,0.765154,0.59,0.484585,0.333319
2,"[B-PROD, I-PROD]",Tebu Lemon,"[B-PROD, I-PROD]",Tebu Lemon,,lemon,0.138603,0.0,0.703119,0.76129,0.539579,0.359646
14,"[B-PROD, I-PROD, O]",JERUK LEMON IMP,"[B-PROD, B-PROD, I-PROD]",JERUK LEMON IMP,,lemon,0.103827,0.0,0.770414,0.611399,0.558359,0.364472
12,"[B-ADJ, I-ADJ, B-PROD]",Home Made Lemonade,"[B-ADJ, B-ADJ, B-PROD]",Lemonade,Home Made,lemon,0.055556,0.0,0.814618,0.936508,0.355383,0.369155
10,"[O, O, B-ADJ, B-PROD]",S-Lemon Macchiato,"[O, O, B-PROD, B-PROD]",Lemon Macchiato,,lemon,0.138603,0.0,0.768464,0.710843,0.372065,0.371619
4,"[B-ADJ, B-PROD]",KOREAN LEMONADE,"[B-ADJ, B-PROD]",LEMONADE,KOREAN,lemon,0.055556,0.0,0.814618,0.808219,0.538144,0.371879


In [85]:
ayam_group.sort_values(by='overall_similarity')

Unnamed: 0,labels,text,preds,product_name,adj,product_group,prod_tfidf_cosine_sim,adj_tfidf_cosine_sim,prodname_bert_cosine_sim,prodname_lev_dist,prodname_jaro_dist,overall_similarity
31,"[B-PROD, I-PROD]",Cheese Tart,"[B-PROD, I-PROD]",Cheese Tart,,cheese,0.0,0.0,0.589401,0.809211,0.594619,0.276482
32,"[B-PROD, B-PROD, B-PROD]",APPLE CREAMCHEESE PASTRY,"[B-PROD, B-PROD, I-PROD]",APPLE CREAMCHEESE PASTRY,,cheese,0.0,0.0,0.670056,0.475822,0.533433,0.284982
30,"[B-PROD, I-PROD, B-ADJ, I-ADJ, I-ADJ, O]",Lemon Tea (L).,"[B-PROD, I-PROD, B-ADJ, I-ADJ, I-ADJ, O]",Lemon Tea,( L ),lemon,0.0,0.0,0.671457,0.901099,0.541837,0.307157
29,"[B-ADJ, I-ADJ, B-PROD]",Home Made Lemonade,"[B-ADJ, B-ADJ, B-PROD]",Lemonade,Home Made,lemon,0.0,0.0,0.681955,0.87234,0.710442,0.317823
19,"[B-PROD, I-PROD, I-PROD]",Ayam. Kremes,"[B-PROD, I-PROD, I-PROD]",Ayam . Kremes,,ayam,0.122231,0.0,0.622529,0.845361,0.449052,0.325387
25,"[B-PROD, I-PROD, O, B-PROD, O, O, O]",TAHU BAKSO Grg Ayam VC [biji,"[O, I-PROD, I-PROD, B-PROD, O, O, B-PROD]",BAKSO Grg Ayam biji,,ayam,0.073112,0.0,0.724526,0.621212,0.503741,0.335421
24,"[B-PROD, I-PROD, I-PROD]",Kuah Kaldu Ayam,"[B-PROD, I-PROD, I-PROD]",Kuah Kaldu Ayam,,ayam,0.107234,0.0,0.752826,0.778481,0.533754,0.366633
14,"[B-PROD, I-PROD, I-PROD]",Ayam Garang Asem,"[B-PROD, I-PROD, I-PROD]",Ayam Garang Asem,,ayam,0.106469,0.0,0.774288,0.759259,0.44378,0.368417
28,"[B-PROD, B-PROD]",DONAT AYAM,"[B-PROD, B-PROD]",DONAT AYAM,,ayam,0.122231,0.0,0.725247,0.931818,0.506782,0.368547
20,"[B-PROD, I-PROD]",CEKER AYAM,"[B-PROD, I-PROD]",CEKER AYAM,,ayam,0.122231,0.0,0.720023,0.928302,0.565151,0.369461


In [86]:
cheese_group.sort_values(by='overall_similarity')

Unnamed: 0,labels,text,preds,product_name,adj,product_group,prod_tfidf_cosine_sim,adj_tfidf_cosine_sim,prodname_bert_cosine_sim,prodname_lev_dist,prodname_jaro_dist,overall_similarity
1,"[B-BRAND, B-PROD]",Toblerone BanCheese,"[B-BRAND, B-PROD]",BanCheese,,cheese,0.0,0.0,0.599189,0.842623,0.513171,0.277506
12,"[O, O, B-PROD]",GERRY SM CHEESE110,"[O, O, B-PROD]",CHEESE110,,cheese,0.0,0.0,0.618513,0.927798,0.524804,0.289109
19,"[O, O, O, B-ADJ, B-PROD, I-PROD]",[MD] SOFT STEAMED CHEESEC,"[O, O, I-ADJ, B-ADJ, B-PROD, I-PROD]",STEAMED CHEESEC,] SOFT,cheese,0.0,0.0,0.649963,0.764881,0.512111,0.291337
17,"[B-ADJ, B-PROD]",Cheese Croquette,"[B-ADJ, B-PROD]",Croquette,Cheese,cheese,0.0,0.0,0.642586,0.810726,0.595575,0.29522
28,"[B-PROD, I-PROD]",CEKER AYAM,"[B-PROD, I-PROD]",CEKER AYAM,,ayam,0.022196,0.0,0.635897,0.86532,0.486673,0.297932
27,"[B-PROD, I-PROD, B-ADJ]",Ayam Sambal Hija,"[B-PROD, B-ADJ, B-ADJ]",Ayam,Sambal Hija,ayam,0.022196,0.0,0.60587,0.767164,0.795945,0.297979
20,"[B-BRAND, B-PROD, O, O]",RICHEESE WFR KJ P,"[B-PROD, B-PROD, O, O]",RICHEESE WFR,,cheese,0.0,0.0,0.659319,0.877133,0.469955,0.298116
7,"[B-PROD, B-PROD, B-PROD]",APPLE CREAMCHEESE PASTRY,"[B-PROD, B-PROD, I-PROD]",APPLE CREAMCHEESE PASTRY,,cheese,0.020567,0.0,0.698382,0.542194,0.454073,0.301446
29,"[B-ADJ, B-ADJ, B-PROD, I-PROD]",MED ICED LEMON TEA,"[B-ADJ, B-ADJ, B-PROD, I-PROD]",LEMON TEA,MED ICED,lemon,0.01478,0.0,0.674493,0.883162,0.562459,0.313526
30,"[O, O, B-ADJ, B-PROD]",S-Lemon Macchiato,"[O, O, B-PROD, B-PROD]",Lemon Macchiato,,lemon,0.01478,0.0,0.700448,0.660668,0.608976,0.313812


Here we can see that among the 3 product groups, the Lemon group and the Ayam group is able to find its outliers using a threshold similarity score of around 0.32. However for the Cheese group, it is unable to find the outliers. Furthermore, the scores are not too different, which makes the decision boundary more confused.

# Future work

## Data

* The data was only labelled by 1 person which could cause inconsistencies, where sometimes I could misclick and label a product name as an adjective. This could be solved if there could be more than 1 person labelling the same data. The majority vote for a label could then be used as the actual label.
* More text preprocessing could be done, for example maybe removing the extra non-alphanumeric characters could help.
* The volume of data would definitely affect the performance of the model. Too small corpus will mean that the model may overfit more easily.
* BIO tagging might not yield the best result compared to IO tagging: https://www.sciencedirect.com/science/article/pii/S1110866520301596

## Model

* More parameters search could be done, such as Learning rate, learning rate callbacks, number of epochs, batch size, Optimizers
* K-fold cross validation could be used since the dataset is small

## Embeddings + Similarity Score

* Instead of using a rule based approach, it might be better to try a clustering method to find the outliers. Some examples are:
    * K-means clustering
    * Hierarchical: agglomerative vs divisive
* More feature engineering could be done
    * n_grams
    * MSE distance function