<center><h1>Feature Engineering of Lazada CIKM Text</h1></center>
<center><h2>A Text Mining Exercise</h2></center>
<center><h2>Part 2</h2></center>

From the submission for CIKM, Features $f_{17}$ to $f_{21}$  were created based on title features.

In [1]:
import re
import pandas as pd
from nltk.metrics import distance
from nltk.tokenize import word_tokenize
from itertools import combinations
pd.set_option('max_colwidth', -1)

import lzd_utils
df = lzd_utils.read_lazada_csv()
df.head()

Unnamed: 0,country,sku_id,title,category_lvl_1,category_lvl_2,category_lvl_3,desc,price,xb
0,my,AD674FAASTLXANMY,Adana Gallery Suri Square Hijab – Light Pink,Fashion,Women,Muslim Wear,<ul><li>Material : Non sheer shimmer chiffon</li><li>Sizes : 52 x 52 inches OR 56 x 56 inches</li><li>Cut with curved ends</li></ul>,49.0,local
1,my,AE068HBAA3RPRDANMY,Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz,Health & Beauty,Bath & Body,Hand & Foot Care,"Formulated with oil-free hydrating botanicals/ Remarkably improves skin texture of abused hands/Restores soft, smooth & refined hands",128.0,international
2,my,AN680ELAA9VN57ANMY,Andoer 150cm Cellphone Smartphone Mini Dual-Headed Omni-Directional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,"TV, Audio / Video, Gaming & Wearables",Audio,Live Sound & Stage,"<ul> <li>150cm mini microphone compatible for iPhone, various smartphones, and also for iPad/ Apple computer/ Macbook.</li> <li>Dual-headed design, allows for two people using simultaneously.</li> <li>Features high sensitivity &amp; omni-directional sounds output, perfect for audio and video recording.</li> <li>3.5mm standard connector jack.</li> <li>Convenient clip-on design, can clip it on your collar.</li> <li>3.5mm standard connector jack. Convenient clip-on design, can clip it on your collar.</li> </ul>",25.07,international
3,my,AN957HBAAAHDF4ANMY,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520ml + Conditioner 250ml),Health & Beauty,Hair Care,Shampoos & Conditioners,<ul> <li>ANMYNA Complaint Silky Set (Shampoo 520ml + Conditioner 250ml)</li> <li>Deep nourish</li> <li>Repair damaged hair</li> <li>Protect the scalp and prevent hair loss</li> </ul>,118.0,local
4,my,AR511HBAXNWAANMY,Argital Argiltubo Green Clay For Face and Body 250ml,Health & Beauty,Men's Care,Body and Skin Care,<ul> <li>100% Authentic</li> <li>Rrefresh and brighten skin</li> <li>Anti-wrinkle and deep cleansing effects</li> </ul>,114.8,international


### Title - Similarity Scores

In [2]:
def remove_non_alphanumeric_chars(x):
    """returns a string without non alpha-numeric characters"""
    return re.sub('[^\s0-9a-zA-Z]', '', x)

def get_all_similarities(t):
    """Given a list of words, generate the similarity scores of all word pairs"""
    all_bigrams = list(combinations(word_tokenize(t),2))
    similarities = []
    for b in all_bigrams:
        similarities.append(distance.jaro_winkler_similarity(b[0], b[1]))
    return similarities

def feat_eng__mean_distance(t):
    s = get_all_similarities(t)
    return sum(s)/len(s)

def feat_eng__pct_max_distance_count(t, threshold=0.85):
    s = get_all_similarities(t)
    return len([i for i in s if i>=threshold])/len(s)

def feat_eng__pct_min_distance_count(t, threshold=0.10):
    s = get_all_similarities(t)
    return len([i for i in s if i<=threshold])/len(s)

def feat_eng__pct_full_string_count(t):
    s = get_all_similarities(t)
    return len([i for i in s if i==1.00])/len(s)

def feat_eng__sum_distances(t):
    s = get_all_similarities(t)
    return sum(s)

In [3]:
t1 = remove_non_alphanumeric_chars("ANIME ZONE One Piece Anime Lovely Tony Tony Chopper Trendy Bifold Casual Leather Wallet").lower()
print(word_tokenize(t1))
print(list(combinations(word_tokenize(t1),2)))
print(["{:.4f}".format(i) for i in get_all_similarities(t1)])
print("{:4.3f}".format(feat_eng__mean_distance(t1)))
print("{:4.3f}".format(feat_eng__pct_max_distance_count(t1)))
print("{:4.3f}".format(feat_eng__pct_min_distance_count(t1)))
print("{:4.3f}".format(feat_eng__pct_full_string_count(t1)))
print("{:4.3f}".format(feat_eng__sum_distances(t1)))


['anime', 'zone', 'one', 'piece', 'anime', 'lovely', 'tony', 'tony', 'chopper', 'trendy', 'bifold', 'casual', 'leather', 'wallet']
[('anime', 'zone'), ('anime', 'one'), ('anime', 'piece'), ('anime', 'anime'), ('anime', 'lovely'), ('anime', 'tony'), ('anime', 'tony'), ('anime', 'chopper'), ('anime', 'trendy'), ('anime', 'bifold'), ('anime', 'casual'), ('anime', 'leather'), ('anime', 'wallet'), ('zone', 'one'), ('zone', 'piece'), ('zone', 'anime'), ('zone', 'lovely'), ('zone', 'tony'), ('zone', 'tony'), ('zone', 'chopper'), ('zone', 'trendy'), ('zone', 'bifold'), ('zone', 'casual'), ('zone', 'leather'), ('zone', 'wallet'), ('one', 'piece'), ('one', 'anime'), ('one', 'lovely'), ('one', 'tony'), ('one', 'tony'), ('one', 'chopper'), ('one', 'trendy'), ('one', 'bifold'), ('one', 'casual'), ('one', 'leather'), ('one', 'wallet'), ('piece', 'anime'), ('piece', 'lovely'), ('piece', 'tony'), ('piece', 'tony'), ('piece', 'chopper'), ('piece', 'trendy'), ('piece', 'bifold'), ('piece', 'casual'), ('

In [4]:
t2 = remove_non_alphanumeric_chars("Jusian AME2958 Women's Push Up Boned Corset Bustier Corselet Black").lower()
print(word_tokenize(t2))
print(list(combinations(word_tokenize(t2),2)))
print(["{:.4f}".format(i) for i in get_all_similarities(t2)])
print("{:4.3f}".format(feat_eng__mean_distance(t2)))
print("{:4.3f}".format(feat_eng__pct_max_distance_count(t2)))
print("{:4.3f}".format(feat_eng__pct_min_distance_count(t2)))
print("{:4.3f}".format(feat_eng__pct_full_string_count(t2)))
print("{:4.3f}".format(feat_eng__sum_distances(t2)))


['jusian', 'ame2958', 'womens', 'push', 'up', 'boned', 'corset', 'bustier', 'corselet', 'black']
[('jusian', 'ame2958'), ('jusian', 'womens'), ('jusian', 'push'), ('jusian', 'up'), ('jusian', 'boned'), ('jusian', 'corset'), ('jusian', 'bustier'), ('jusian', 'corselet'), ('jusian', 'black'), ('ame2958', 'womens'), ('ame2958', 'push'), ('ame2958', 'up'), ('ame2958', 'boned'), ('ame2958', 'corset'), ('ame2958', 'bustier'), ('ame2958', 'corselet'), ('ame2958', 'black'), ('womens', 'push'), ('womens', 'up'), ('womens', 'boned'), ('womens', 'corset'), ('womens', 'bustier'), ('womens', 'corselet'), ('womens', 'black'), ('push', 'up'), ('push', 'boned'), ('push', 'corset'), ('push', 'bustier'), ('push', 'corselet'), ('push', 'black'), ('up', 'boned'), ('up', 'corset'), ('up', 'bustier'), ('up', 'corselet'), ('up', 'black'), ('boned', 'corset'), ('boned', 'bustier'), ('boned', 'corselet'), ('boned', 'black'), ('corset', 'bustier'), ('corset', 'corselet'), ('corset', 'black'), ('bustier', 'corse

In [5]:
t3 = remove_non_alphanumeric_chars("Women clutch evening bags velvet hard holder purse bags 4 colors clutches shoulder chain evening bags for wedding bridal handbag = Size: Not Specified = Color: YM1053blue").lower()
print(word_tokenize(t3))
print(list(combinations(word_tokenize(t3),2)))
print(["{:.4f}".format(i) for i in get_all_similarities(t3)])
print(["{:.4f}".format(i) for i in get_all_similarities(t3)])
print("{:4.3f}".format(feat_eng__mean_distance(t3)))
print("{:4.3f}".format(feat_eng__pct_max_distance_count(t3)))
print("{:4.3f}".format(feat_eng__pct_min_distance_count(t3)))
print("{:4.3f}".format(feat_eng__pct_full_string_count(t3)))
print("{:4.3f}".format(feat_eng__sum_distances(t3)))


['women', 'clutch', 'evening', 'bags', 'velvet', 'hard', 'holder', 'purse', 'bags', '4', 'colors', 'clutches', 'shoulder', 'chain', 'evening', 'bags', 'for', 'wedding', 'bridal', 'handbag', 'size', 'not', 'specified', 'color', 'ym1053blue']
[('women', 'clutch'), ('women', 'evening'), ('women', 'bags'), ('women', 'velvet'), ('women', 'hard'), ('women', 'holder'), ('women', 'purse'), ('women', 'bags'), ('women', '4'), ('women', 'colors'), ('women', 'clutches'), ('women', 'shoulder'), ('women', 'chain'), ('women', 'evening'), ('women', 'bags'), ('women', 'for'), ('women', 'wedding'), ('women', 'bridal'), ('women', 'handbag'), ('women', 'size'), ('women', 'not'), ('women', 'specified'), ('women', 'color'), ('women', 'ym1053blue'), ('clutch', 'evening'), ('clutch', 'bags'), ('clutch', 'velvet'), ('clutch', 'hard'), ('clutch', 'holder'), ('clutch', 'purse'), ('clutch', 'bags'), ('clutch', '4'), ('clutch', 'colors'), ('clutch', 'clutches'), ('clutch', 'shoulder'), ('clutch', 'chain'), ('clutc

In [6]:
titles = df['title'].head(50)
titles_AC = titles.apply(remove_non_alphanumeric_chars)

In [7]:
f17, f18, f19, f20, f21 = [titles_AC.apply(feat_eng__mean_distance),
                           titles_AC.apply(feat_eng__pct_max_distance_count),
                           titles_AC.apply(feat_eng__pct_min_distance_count),
                           titles_AC.apply(feat_eng__pct_full_string_count),
                           titles_AC.apply(feat_eng__sum_distances),]
for (a, b, c, d, e) in zip(f17, f18, f19, f20, f21):
    print("{:.3f},{:.3f},{:.3f},{:.4f},{:7.2f}".format(a,b,c,d,e))

0.262,0.000,0.476,0.0000,   5.50
0.249,0.000,0.476,0.0000,   5.23
0.282,0.012,0.456,0.0000,  48.15
0.268,0.036,0.500,0.0000,   7.51
0.228,0.028,0.556,0.0000,   8.22
0.123,0.008,0.758,0.0000,  14.73
0.204,0.036,0.636,0.0182,  11.21
0.240,0.000,0.467,0.0000,   3.60
0.163,0.000,0.667,0.0000,   5.86
0.282,0.000,0.417,0.0000,  10.15
0.346,0.000,0.364,0.0000,  19.00
0.308,0.000,0.400,0.0000,   3.08
0.342,0.000,0.333,0.0000,  12.32
0.326,0.018,0.327,0.0000,  17.93
0.334,0.000,0.300,0.0000,   3.34
0.268,0.000,0.455,0.0000,  14.73
0.363,0.000,0.253,0.0000,  33.06
0.187,0.000,0.637,0.0000,  17.02
0.268,0.005,0.468,0.0000,  50.87
0.164,0.000,0.689,0.0000,   7.37
0.195,0.000,0.619,0.0000,   4.10
0.145,0.000,0.722,0.0000,   5.21
0.258,0.000,0.491,0.0000,  14.20
0.134,0.013,0.744,0.0128,  10.47
0.234,0.000,0.500,0.0000,   1.41
0.204,0.025,0.625,0.0250,  24.54
0.163,0.000,0.667,0.0000,   5.87
0.280,0.000,0.400,0.0000,   2.80
0.282,0.026,0.479,0.0158,  53.49
0.233,0.000,0.550,0.0000,  53.91
0.349,0.02

In [8]:
titles_feats = pd.DataFrame({'titles' : titles, 
                             'titles_AC' : titles_AC, 
                             'f17' : f17, 'f18' : f18,
                             'f19' : f19, 'f20' : f20, 'f21' : f21,})

display(titles_feats[['titles','titles_AC','f17','f18','f19','f20','f21',]])

Unnamed: 0,titles,titles_AC,f17,f18,f19,f20,f21
0,Adana Gallery Suri Square Hijab – Light Pink,Adana Gallery Suri Square Hijab Light Pink,0.261735,0.0,0.47619,0.0,5.496429
1,Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz,Cuba Heartbreaker Eau De Parfum Spray 100ml33oz,0.249118,0.0,0.47619,0.0,5.231481
2,Andoer 150cm Cellphone Smartphone Mini Dual-Headed Omni-Directional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,Andoer 150cm Cellphone Smartphone Mini DualHeaded OmniDirectional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,0.281571,0.011696,0.45614,0.0,48.14867
3,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520ml + Conditioner 250ml),ANMYNA Complaint Silky Set Shampoo 520ml Conditioner 250ml,0.268068,0.035714,0.5,0.0,7.505916
4,Argital Argiltubo Green Clay For Face and Body 250ml,Argital Argiltubo Green Clay For Face and Body 250ml,0.228355,0.027778,0.555556,0.0,8.220767
5,Asus TP300LJ-DW004H Transformer Book Flip 4GB Intel Core i5 13 Inch + Free 2Year Ezi Care Warranty,Asus TP300LJDW004H Transformer Book Flip 4GB Intel Core i5 13 Inch Free 2Year Ezi Care Warranty,0.122789,0.008333,0.758333,0.0,14.734678
6,NG-40C Ring-Shaped 40W 3166lm 5400K Macro Photography Light Circle Ring Light,NG40C RingShaped 40W 3166lm 5400K Macro Photography Light Circle Ring Light,0.203815,0.036364,0.636364,0.018182,11.209798
7,Buytra Exfoliating Peel Foot Mask 1Pair,Buytra Exfoliating Peel Foot Mask 1Pair,0.240067,0.0,0.466667,0.0,3.60101
8,CLiPtec OCC121 Slim Flat USB 3.0 Extension Cable 1.5m,CLiPtec OCC121 Slim Flat USB 30 Extension Cable 15m,0.16274,0.0,0.666667,0.0,5.858624
9,McDonald's Coke Can Glass Limited Edition 12oz Purple Color,McDonalds Coke Can Glass Limited Edition 12oz Purple Color,0.282,0.0,0.416667,0.0,10.151984


**References**

- [Github / minhcp](https://github.com/minhcp/CIKMCup17) for the dataset

- [CIKM AnalytiCup 2017 – Lazada Product Title Quality Challenge](https://arxiv.org/pdf/1804.01000.pdf) for the feature engineering techniques