<center><h1>Feature Engineering of Lazada CIKM Text</h1></center>
<center><h2>A Text Mining Exercise</h2></center>
<center><h2>Part 1</h2></center>

From the submission for CIKM, Features $f_7$ to $F_{16}$  were created based on title features.

In [1]:
import re
import pandas as pd
pd.set_option('max_colwidth', -1)

import lzd_utils
df = lzd_utils.read_lazada_csv()
df.head()

Unnamed: 0,country,sku_id,title,category_lvl_1,category_lvl_2,category_lvl_3,desc,price,xb
0,my,AD674FAASTLXANMY,Adana Gallery Suri Square Hijab – Light Pink,Fashion,Women,Muslim Wear,<ul><li>Material : Non sheer shimmer chiffon</li><li>Sizes : 52 x 52 inches OR 56 x 56 inches</li><li>Cut with curved ends</li></ul>,49.0,local
1,my,AE068HBAA3RPRDANMY,Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz,Health & Beauty,Bath & Body,Hand & Foot Care,"Formulated with oil-free hydrating botanicals/ Remarkably improves skin texture of abused hands/Restores soft, smooth & refined hands",128.0,international
2,my,AN680ELAA9VN57ANMY,Andoer 150cm Cellphone Smartphone Mini Dual-Headed Omni-Directional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,"TV, Audio / Video, Gaming & Wearables",Audio,Live Sound & Stage,"<ul> <li>150cm mini microphone compatible for iPhone, various smartphones, and also for iPad/ Apple computer/ Macbook.</li> <li>Dual-headed design, allows for two people using simultaneously.</li> <li>Features high sensitivity &amp; omni-directional sounds output, perfect for audio and video recording.</li> <li>3.5mm standard connector jack.</li> <li>Convenient clip-on design, can clip it on your collar.</li> <li>3.5mm standard connector jack. Convenient clip-on design, can clip it on your collar.</li> </ul>",25.07,international
3,my,AN957HBAAAHDF4ANMY,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520ml + Conditioner 250ml),Health & Beauty,Hair Care,Shampoos & Conditioners,<ul> <li>ANMYNA Complaint Silky Set (Shampoo 520ml + Conditioner 250ml)</li> <li>Deep nourish</li> <li>Repair damaged hair</li> <li>Protect the scalp and prevent hair loss</li> </ul>,118.0,local
4,my,AR511HBAXNWAANMY,Argital Argiltubo Green Clay For Face and Body 250ml,Health & Beauty,Men's Care,Body and Skin Care,<ul> <li>100% Authentic</li> <li>Rrefresh and brighten skin</li> <li>Anti-wrinkle and deep cleansing effects</li> </ul>,114.8,international


### Title

In [2]:
def remove_non_alphanumeric_chars(x):
    """returns a string without non alpha-numeric characters"""
    return re.sub('[^\s0-9a-zA-Z]', '', x)

def feat_eng__num_words(x):
    """no. of terms"""
    return len(x.split())

def feat_eng__max_length(x):
    """max length of all terms"""
    return max([len(t) for t in x.split()])

def feat_eng__min_length(x):
    """min length of all terms"""
    return min([len(t) for t in x.split()])

def feat_eng__mean_length(x):
    """mean length of all terms"""
    terms = [len(t) for t in x.split()]
    return sum(terms)/len(terms)

def feat_eng__contains_digit(x):
    """1 if x contains a digit, 0 otherwise"""
    if re.search('\d', x) is not None:
        return 1
    return 0

def feat_eng__pct_non_alphanumeric(x):
    """%tage of non-alphanumeric characters""" 
    ac = remove_non_alphanumeric_chars(x)
    remove_spaces = lambda x : re.sub('\s', '', x)
    bc_chars, ac_chars = remove_spaces(x), remove_spaces(ac)
    return 1-(len(ac_chars)/len(bc_chars))

In [3]:
# Testing each function
t = 'McDonald\'s Coke Can Glass Limited Edition 12oz Purple Color  '

print(t)
print(feat_eng__num_words(t))
print(feat_eng__max_length(t))
print(feat_eng__min_length(t))
print(feat_eng__mean_length(t))
print(remove_non_alphanumeric_chars(t))
print(feat_eng__pct_non_alphanumeric(t))
print(feat_eng__contains_digit(t))

McDonald's Coke Can Glass Limited Edition 12oz Purple Color  
9
10
3
5.666666666666667
McDonalds Coke Can Glass Limited Edition 12oz Purple Color  
0.019607843137254943
1


In [4]:
# Extract the dataset
titles = df['title'].head(50)

In [5]:
titles_AC = titles.apply(remove_non_alphanumeric_chars)
for (t,u) in zip(titles,titles_AC):
    print(t)
    print(u)
    print()

Adana Gallery Suri Square Hijab – Light Pink
Adana Gallery Suri Square Hijab  Light Pink

Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz
Cuba Heartbreaker Eau De Parfum Spray 100ml33oz

Andoer 150cm Cellphone Smartphone Mini Dual-Headed Omni-Directional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones
Andoer 150cm Cellphone Smartphone Mini DualHeaded OmniDirectional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones

ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520ml + Conditioner 250ml)
ANMYNA Complaint Silky Set  Shampoo 520ml  Conditioner 250ml

Argital Argiltubo Green Clay For Face and Body 250ml
Argital Argiltubo Green Clay For Face and Body 250ml

Asus TP300LJ-DW004H Transformer Book Flip 4GB Intel Core i5 13 Inch + Free 2Year Ezi Care Warranty
Asus TP300LJDW004H Transformer Book Flip 4GB Intel Core i5 13 Inch  Free 2Year Ezi Care Warranty

NG-40C Ring-Shaped 40W 3166lm 5400K Macro Photography Light Circle Ring Light
NG40C RingShaped 40W

In [6]:
f7, f8, f9, f10 = [titles_AC.apply(feat_eng__num_words),
                  titles_AC.apply(feat_eng__max_length),
                  titles_AC.apply(feat_eng__min_length),
                  titles_AC.apply(feat_eng__mean_length)]
for (a, b, c, d) in zip(f7, f8, f9, f10):
    print("{:3d},{:3d},{:2d},{:5.2f}".format(a,b,c,d))

  7,  7, 4, 5.14
  7, 12, 2, 5.86
 19, 15, 1, 6.21
  8, 11, 3, 6.38
  9,  9, 3, 4.89
 16, 13, 2, 5.00
 11, 11, 3, 5.91
  6, 11, 4, 5.67
  9,  9, 2, 4.78
  9,  9, 3, 5.56
 11, 10, 3, 6.64
  5, 13, 3, 7.60
  9, 10, 4, 5.89
 11, 10, 3, 7.27
  5,  9, 3, 6.60
 11,  8, 2, 4.82
 14, 10, 4, 6.07
 14, 10, 2, 4.93
 20,  9, 3, 5.30
 10,  7, 2, 4.10
  7, 10, 3, 5.71
  9, 15, 2, 5.33
 11, 10, 3, 6.09
 13,  8, 1, 4.46
  4,  7, 5, 6.25
 16, 11, 2, 5.00
  9,  9, 2, 5.00
  5,  9, 5, 6.80
 20, 10, 1, 4.65
 22,  8, 1, 4.73
 10,  8, 3, 6.00
  5, 11, 3, 7.80
 16, 10, 1, 5.25
 10,  8, 4, 6.20
 14, 10, 3, 5.57
 12,  9, 2, 4.58
 11, 11, 3, 6.00
 12,  9, 2, 5.75
 28,  8, 2, 4.21
 12,  7, 2, 4.42
 11,  8, 1, 4.64
 21,  7, 3, 5.19
 13, 10, 4, 5.69
  2,  5, 4, 4.50
 13,  6, 2, 4.54
 11,  7, 2, 5.09
 20, 10, 4, 6.00
  9, 10, 4, 6.00
  4, 10, 6, 8.00
  8,  9, 1, 4.88


In [7]:
f11, f12, f13, f14 = [titles.apply(feat_eng__num_words),
                      titles.apply(feat_eng__max_length),
                      titles.apply(feat_eng__min_length),
                      titles.apply(feat_eng__mean_length)]
for (a, b, c, d) in zip(f11, f12, f13, f14):
    print("{:3d},{:3d},{:2d},{:5.2f}".format(a,b,c,d))

  8,  7, 1, 4.62
  7, 12, 2, 6.14
 19, 16, 1, 6.32
 10, 11, 1, 6.00
  9,  9, 3, 4.89
 17, 14, 1, 4.82
 11, 11, 3, 6.09
  6, 11, 4, 5.67
  9,  9, 3, 5.00
  9, 10, 3, 5.67
 11, 12, 3, 7.18
  5, 15, 3, 8.00
  9, 10, 4, 5.89
 11, 11, 3, 7.64
  5,  9, 5, 7.00
 11,  8, 2, 4.91
 14, 10, 4, 6.21
 14, 12, 2, 5.07
 20,  9, 3, 5.40
 10,  7, 2, 4.30
  7, 10, 3, 5.71
  9, 17, 3, 5.89
 12, 10, 1, 5.67
 13,  8, 1, 4.69
  4,  7, 5, 6.25
 17, 11, 1, 4.88
  9,  9, 2, 5.00
  5, 10, 6, 7.40
 21, 10, 1, 4.57
 23,  8, 1, 4.61
 10,  8, 3, 6.20
  5, 11, 3, 7.80
 16, 10, 1, 5.44
 10,  8, 4, 6.30
 14, 12, 3, 5.71
 12,  9, 2, 4.83
 11, 11, 3, 6.00
 12,  9, 3, 6.00
 29,  8, 1, 4.24
 12,  7, 2, 4.42
 11,  8, 1, 5.00
 21,  7, 3, 5.19
 15, 10, 1, 5.07
  2,  5, 4, 4.50
 13,  8, 2, 4.69
 11,  7, 2, 5.36
 20, 10, 4, 6.05
  9, 10, 4, 6.00
  4, 10, 6, 8.00
  8,  9, 1, 4.88


In [8]:
f15 = titles.apply(feat_eng__contains_digit)
print(f15.tolist())

[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1]


In [9]:
f16 = titles.apply(feat_eng__pct_non_alphanumeric)
print(["{:.3f}".format(i) for i in f16.tolist()])

['0.027', '0.047', '0.017', '0.150', '0.000', '0.024', '0.030', '0.000', '0.044', '0.020', '0.076', '0.050', '0.000', '0.048', '0.057', '0.019', '0.023', '0.028', '0.019', '0.047', '0.000', '0.094', '0.015', '0.049', '0.000', '0.036', '0.000', '0.081', '0.031', '0.019', '0.032', '0.000', '0.034', '0.016', '0.025', '0.052', '0.000', '0.042', '0.041', '0.000', '0.073', '0.000', '0.026', '0.000', '0.033', '0.051', '0.008', '0.000', '0.000', '0.000']


In [10]:
titles_feats = pd.DataFrame({'titles' : titles, 
                             'titles_AC' : titles_AC, 
                             'f7' : f7, 'f8' : f8,
                             'f9' : f9, 'f10' : f10,
                             'f11' : f11, 'f12' : f12, 
                             'f13' : f13, 'f14' : f14, 'f15' : f15, 'f16' : f16})

display(titles_feats[['titles','titles_AC','f7','f8','f9','f10','f11','f12','f13','f14','f15','f16']])

Unnamed: 0,titles,titles_AC,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16
0,Adana Gallery Suri Square Hijab – Light Pink,Adana Gallery Suri Square Hijab Light Pink,7,7,4,5.142857,8,7,1,4.625,0,0.027027
1,Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz,Cuba Heartbreaker Eau De Parfum Spray 100ml33oz,7,12,2,5.857143,7,12,2,6.142857,1,0.046512
2,Andoer 150cm Cellphone Smartphone Mini Dual-Headed Omni-Directional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,Andoer 150cm Cellphone Smartphone Mini DualHeaded OmniDirectional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,19,15,1,6.210526,19,16,1,6.315789,1,0.016667
3,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520ml + Conditioner 250ml),ANMYNA Complaint Silky Set Shampoo 520ml Conditioner 250ml,8,11,3,6.375,10,11,1,6.0,1,0.15
4,Argital Argiltubo Green Clay For Face and Body 250ml,Argital Argiltubo Green Clay For Face and Body 250ml,9,9,3,4.888889,9,9,3,4.888889,1,0.0
5,Asus TP300LJ-DW004H Transformer Book Flip 4GB Intel Core i5 13 Inch + Free 2Year Ezi Care Warranty,Asus TP300LJDW004H Transformer Book Flip 4GB Intel Core i5 13 Inch Free 2Year Ezi Care Warranty,16,13,2,5.0,17,14,1,4.823529,1,0.02439
6,NG-40C Ring-Shaped 40W 3166lm 5400K Macro Photography Light Circle Ring Light,NG40C RingShaped 40W 3166lm 5400K Macro Photography Light Circle Ring Light,11,11,3,5.909091,11,11,3,6.090909,1,0.029851
7,Buytra Exfoliating Peel Foot Mask 1Pair,Buytra Exfoliating Peel Foot Mask 1Pair,6,11,4,5.666667,6,11,4,5.666667,1,0.0
8,CLiPtec OCC121 Slim Flat USB 3.0 Extension Cable 1.5m,CLiPtec OCC121 Slim Flat USB 30 Extension Cable 15m,9,9,2,4.777778,9,9,3,5.0,1,0.044444
9,McDonald's Coke Can Glass Limited Edition 12oz Purple Color,McDonalds Coke Can Glass Limited Edition 12oz Purple Color,9,9,3,5.555556,9,10,3,5.666667,1,0.019608


**References**

- [Github / minhcp](https://github.com/minhcp/CIKMCup17) for the dataset

- [CIKM AnalytiCup 2017 – Lazada Product Title Quality Challenge](https://arxiv.org/pdf/1804.01000.pdf) for the feature engineering techniques