# Marketplace Category Validation

    This code is for validating the intrigity of the categories. The input file provided by the marketplace team and the relevant information that are needed for this script are:
    1. current_categories
    2. new_categories_code
    3. new_categories_label
    4. name-en_US-CDS
    5. name-th_TH-CDS
    6. description-en_US-CDS

    These mentioned infos will be lemmatized (similar words will be recognized as the same) and tokenized (plural forms of the words will be removed).
***
    Then, the processed words will be counted to find how many words of the new categories and the current categories / name / description are the same. We can infer from the matching words that if there are many words matched, the new categories tend to be correct, and if there are no matching words, the new categories tend to be incorrect. The matching words are translated into 2 scores.
    1. Score 1 = 
        word match between "current category" and "new category(code/label)"
        + word match between "product name" and "new category(code/label)"
        + word match between "product description" and "new category(code/label)"
    
    2. Score 2 = 
        word match between "current category" and "new category(label)"
        + word match between "product name" and "new category(label)"
        + word match between "product description" and "new category(label)"

    The difference between Score 1 and Score 2 is Score 2 remove new category code from word matching. As a result, Score 2 will be less than or equal to Score 1.
    
***
    The result of this script will be the status of each product (row) which consists of 3 statuses.
    1. confident = Score 1 > 0 and Score 2 > 0
    2. medium = Score 1 > 0 and Score 2 = 0
    3. suspect = Score 1 = 0 and Score 2 = 0
    
    The status "confident" and "suspect" are highly accurate however the status "medium" is not and need manual verification

In [23]:
import pandas as pd
import re
import os

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

In [10]:
INPUT_FILE = 'data/taxonomy_merge.xlsx'

In [11]:
## Hard code to add relevant words to the list. If we see words in the left, we will add words in the right.
WORD_ADD = {
    'footwear': ['shoe'],
    'sleepwear': ['pyjama', 'nightdress', 'sleep'],
#     'pant': ['short'],
    'tumbler': ['bottle'],
    'contactlen': ['คอนแทคเลนส์สายตา'],
    'contact': ['คอนแทคเลนส์สายตา', 'contactlen'],
    'waterheat': ['เครื่องทำน้ำอุ่น'],
    'facemask': ['mask', 'หน้ากากผ้า', 'หน้ากากผ้า1pc'],
}

In [12]:
data = pd.read_excel(INPUT_FILE, index_col=0)

In [13]:
data.head()

Unnamed: 0,sku,current_categories,new_categories_code,new_categories_label,Source.Name,brand_name-CDS,content_record_id-CDS,description-en_US-CDS,description-th_TH-CDS,dimension_depth-CDS,...,enable_on_channel-CDS,group_name-CDS,name-en_US-CDS,name-th_TH-CDS,name_common-en_US,name_common-th_TH,package_dimention-en_US-CDS,package_dimention-th_TH-CDS,parent,unit_dimension-CDS
0,CDS10000946,"CDS,EOR,EOR_CDS_8,EOR_CDS_8_801,EOR_CDS_8_801_...","electronic_gadgets,home_appliances","Electronic & Gadgets,Home Appliances",1_products_export_grid_context_en_US_CDS_2023-...,FORKITS,taxonomy,<p><strong>TheForkit’s 20-Speed Hand Blender f...,<p><strong>เครื่องปั่นมือถืออเนกประสงค์ <span ...,0.0,...,1,เครื่องปั่นมือถืออเนกประสงค์,เครื่องปั่นมือถืออเนกประสงค์,เครื่องปั่นมือถืออเนกประสงค์,เครื่องปั่นมือถืออเนกประสงค์,เครื่องปั่นมือถืออเนกประสงค์,14_5_x_21_5_x_7_5_cm_,14_5_x_21_5_x_7_5_cm_,,
1,CDS10032282,"CDS,EOR,EOR_CDS_9,EOR_CDS_9_904,EOR_CDS_9_904_...",apparel_accessories__clothing,Clothing,1_products_export_grid_context_en_US_CDS_2023-...,,taxonomy,<p> This cap sleeves T-shirt features a sweet ...,<p> เสื้อยืดแขนกุดเด็กหญิ<strong>ง</strong><st...,33.0,...,1,,เสื้อยืดแขนกุดพิมพ์ลาย BL 14,เสื้อยืดแขนกุดพิมพ์ลาย BL 14,เสื้อยืดแขนกุดพิมพ์ลาย BL 14,เสื้อยืดแขนกุดพิมพ์ลาย BL 14,14_5_x_21_5_x_7_5_cm_,14_5_x_21_5_x_7_5_cm_,,
2,CDS10033555,"CDS,EOR,EOR_CDS_1,EOR_CDS_1_104,EOR_CDS_1_104_...","beauty,beauty__makeup,makeup__makeup_tools,mak...","Beauty,Makeup,Makeup Tools,Face Brushes,Tools",1_products_export_grid_context_en_US_CDS_2023-...,REAL_TECHNIQUES,taxonomy,"<p><strong><span class=""caps"">REAL</span> <spa...","<p><strong><span class=""caps"">REAL</span> <spa...",0.0,...,1,แปรงรองพื้น 101 Triangle Foundation Brush,แปรงรองพื้น,แปรงรองพื้น,แปรงรองพื้น,แปรงรองพื้น,14_5_x_21_5_x_7_5_cm_,14_5_x_21_5_x_7_5_cm_,,
3,CDS10033579,"CDS,EOR,EOR_CDS_1,EOR_CDS_1_104,EOR_CDS_1_104_...","beauty,beauty__makeup,makeup__makeup_tools,mak...","Beauty,Makeup,Makeup Tools,Face Brushes,Tools",1_products_export_grid_context_en_US_CDS_2023-...,REAL_TECHNIQUES,taxonomy,"<p><strong><span class=""caps"">REAL</span> <spa...","<p><strong><span class=""caps"">REAL</span> <spa...",0.0,...,1,แปรงแต่งห้นา 300 Tapered Brush,300 Tapered Brush,แปรงแต่งห้นา 300 Tapered Brush,300 Tapered Brush,แปรงแต่งห้นา 300 Tapered Brush,14_5_x_21_5_x_7_5_cm_,14_5_x_21_5_x_7_5_cm_,,
4,CDS10033609,"CDS,EOR,EOR_CDS_1,EOR_CDS_1_104,EOR_CDS_1_104_...","beauty,beauty__makeup,makeup__makeup_tools,mak...","Beauty,Makeup,Makeup Tools,Face Brushes",1_products_export_grid_context_en_US_CDS_2023-...,REAL_TECHNIQUES,taxonomy,"<p><strong><span class=""caps"">REAL</span> <spa...","<p><strong><span class=""caps"">REAL</span> <spa...",0.0,...,1,แปรงแต่งหน้า 200 Oval Shadow Brush,200 Oval Shadow Brush,แปรงแต่งหน้า 200 Oval Shadow Brush,200 Oval Shadow Brush,แปรงแต่งหน้า 200 Oval Shadow Brush,14_5_x_21_5_x_7_5_cm_,14_5_x_21_5_x_7_5_cm_,,


In [14]:
# lemmatizer to lemmatize the words
lemmatizer = WordNetLemmatizer()

# stemmer to stem the words
ps = PorterStemmer()

In [15]:
def tokennize_cat_path(path):
    """ lemmatize and stem categories 
        path = category in string
        output = set of tokens """
    path = str(path)
    words = set(path.replace(',', '//').replace('_', '//').replace('&', '//').replace('  ', '').replace(' ', '').split('//'))
    lemmas = [ps.stem(lemmatizer.lemmatize(w)) for w in words if not w.isnumeric() and w != '' and w==w]
    for w in lemmas:
        if w in WORD_ADD:
            lemmas += WORD_ADD[w]
    return set(lemmas)

def tokennize_product_name(product_name):
    """ lemmatize and stem product name 
        product_name = product name in string
        output = set of tokens """
    product_name = str(product_name)
    words = set(product_name.split(' '))
    lemmas = [ps.stem(lemmatizer.lemmatize(w)) for w in words if not w.isnumeric() and w != '' and w==w]
    return set(lemmas)

def tokenize_description(sentence):
    """ lemmatize and stem product description 
        sentence = product description
        output = set of tokens """
    sentence = str(sentence)
    sentence = re.sub(r'<[^>]*>', ' ', sentence)
    sentence = re.sub(r'\s+', ' ', sentence)
    words = set(sentence.split(' '))
    lemmas = [ps.stem(lemmatizer.lemmatize(w)) for w in words if not w.isnumeric() and w != '' and w==w]
    return set(lemmas)
    
def matching_score(list1, list2):
    """ calculate the number of words matched between 2 lists 
        list1, list2 = tokens output from the previous functions
        output = matching score """
    score = 0
    for w1 in list1:
        for w2 in list2:
            if w1==w2:
                score += 1
    return score

## Lemmatize and tem the words into tokens (list of words)

In [16]:
data['current_cat_token'] = data['current_categories'].apply(tokennize_cat_path)
data['new_cat_code_token'] = data['new_categories_code'].apply(tokennize_cat_path)
data['new_cat_label_token'] = data['new_categories_label'].apply(tokennize_cat_path)
data['name_en_token'] = data['name-en_US-CDS'].apply(tokennize_product_name)
data['name_th_token'] = data['name-th_TH-CDS'].apply(tokennize_product_name)
data['description_token'] = data['description-en_US-CDS'].apply(tokenize_description)

In [17]:
data[['name_en_token', 'name_th_token', 'current_cat_token', 'new_cat_code_token', 'new_cat_label_token', 'description_token']].head(5)

Unnamed: 0,name_en_token,name_th_token,current_cat_token,new_cat_code_token,new_cat_label_token,description_token
0,{เครื่องปั่นมือถืออเนกประสงค์},{เครื่องปั่นมือถืออเนกประสงค์},"{eyc, 00001a, cd, sasf, small, eor, cook, appl...","{gadget, home, electron, applianc}","{gadget, electron, homeappli}","{blend, item, hand, motor,, “efficient”, thefo..."
1,"{เสื้อยืดแขนกุดพิมพ์ลาย, bl}","{เสื้อยืดแขนกุดพิมพ์ลาย, bl}","{kid, t, eyc, shirt, cd, sasf, eor, tshirt, sw...","{cloth, accessori, apparel}",{cloth},"{cap, item, thi, babi, light, paul, wring,, sh..."
2,{แปรงรองพื้น},{แปรงรองพื้น},"{face, eyc, makeup, cd, sasf, brush, eor, appl...","{face, person, makeup, brush, care, beauti, tool}","{makeup, makeuptool, facebrush, beauti, tool}","{face, when, item, nose., color,, to, techniqu..."
3,"{brush, taper}","{brush, แปรงแต่งห้นา, taper}","{face, eyc, makeup, cd, sasf, brush, eor, appl...","{face, person, makeup, brush, care, beauti, tool}","{makeup, makeuptool, facebrush, beauti, tool}","{แห้งไว, item, color,, techniqu, ทำความสะอาดง่..."
4,"{oval, brush, shadow}","{oval, brush, shadow, แปรงแต่งหน้า}","{face, eyc, makeup, cd, sasf, brush, eor, appl...","{face, makeup, brush, beauti, tool}","{facebrush, beauti, makeup, makeuptool}","{when, face, item, oval, color,, to, techniqu,..."


## Calculate the matching scores

In [18]:
# current cats vs new cats
data['score_current_new_code'] = data.apply(lambda row: matching_score(row['current_cat_token'], row['new_cat_code_token']), axis=1)
data['score_current_new_label'] = data.apply(lambda row: matching_score(row['current_cat_token'], row['new_cat_label_token']), axis=1)

# current product name (Eng) vs new cats
data['score_name_new_code'] = data.apply(lambda row: matching_score(row['name_en_token'], row['new_cat_code_token']), axis=1)
data['score_name_new_label'] = data.apply(lambda row: matching_score(row['name_en_token'], row['new_cat_label_token']), axis=1)

# current product name (TH) vs new cats
data['score_name_th_new_code'] = data.apply(lambda row: matching_score(row['name_th_token'], row['new_cat_code_token']), axis=1)
data['score_name_th_new_label'] = data.apply(lambda row: matching_score(row['name_th_token'], row['new_cat_label_token']), axis=1)

# current product description (Eng) vs new cats
data['score_description_new_code'] = data.apply(lambda row: matching_score(row['description_token'], row['new_cat_code_token']), axis=1)
data['score_description_new_label'] = data.apply(lambda row: matching_score(row['description_token'], row['new_cat_label_token']), axis=1)

## Calculate Score 1 and Score 2

In [19]:
SUM_COLS1 = [
    'score_current_new_code', 
    'score_current_new_label', 
    'score_name_new_code', 
    'score_name_new_label', 
    'score_description_new_code', 
    'score_description_new_label',
    'score_name_th_new_code', 
    'score_name_th_new_label'
]

SUM_COLS2 = [
    'score_current_new_label', 
    'score_name_new_label', 
    'score_description_new_label',
    'score_name_th_new_label',
    
]


In [20]:
data['total_score1'] = data[SUM_COLS1].sum(axis=1)
data['total_score2'] = data[SUM_COLS2].sum(axis=1)

## Define Status (Confident / Medium / Suspect)

In [21]:
def _classify_cat(t1,t2):
    if t1==0:
        return 'suspect'
    elif t2==0:
        return 'medium'
    else:
        return 'confident'
    
data['status'] = data.apply(lambda row: _classify_cat(row['total_score1'], row['total_score2']), axis=1)

## Save output

In [24]:
OUTPUT_PATH = r'output/' 
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

In [25]:
out_cols = ['sku', 
            'current_categories', 
            'new_categories_code',
           'new_categories_label', 
#             'Source.Name', 
            'brand_name-CDS',
#             'content_record_id-CDS', 
            'description-en_US-CDS',
#        'description-th_TH-CDS', 'dimension_depth-CDS', 'dimension_height-CDS',
#        'dimension_width-CDS', 'enable_on_channel-CDS', 
            'group_name-CDS',
       'name-en_US-CDS', 'name-th_TH-CDS', 'name_common-en_US',
       'name_common-th_TH', 'package_dimention-en_US-CDS',
       'package_dimention-th_TH-CDS', 'parent', 'unit_dimension-CDS', 'status']

In [26]:
data[out_cols].to_csv(OUTPUT_PATH + 'full_result.csv', encoding='utf-8-sig')

In [27]:
data[data['status']=='suspect'].to_csv(OUTPUT_PATH + 'suspect.csv', encoding='utf-8-sig')
data[data['status']=='medium'].to_csv(OUTPUT_PATH + 'medium.csv', encoding='utf-8-sig')
data[data['status']=='confident'].to_csv(OUTPUT_PATH + 'confident.csv', encoding='utf-8-sig')

## Analysis results

In [28]:
print('Total number of data', data.shape[0])

Total number of data 233842


In [29]:
print('The number of each status')
data['status'].value_counts()

The number of each status


confident    215409
suspect       13362
medium         5071
Name: status, dtype: int64

In [30]:
print('The number of each status (%)')
data['status'].value_counts() / data.shape[0]

The number of each status (%)


confident    0.921173
suspect      0.057141
medium       0.021686
Name: status, dtype: float64