<center><h1>Feature Engineering of Lazada CIKM Text</h1></center>
<center><h2>A Text Mining Exercise</h2></center>
<center><h2>Part 3</h2></center>

In [1]:
import re
import pandas as pd
from nltk import ngrams
from nltk.tokenize import word_tokenize
pd.set_option('max_colwidth', -1)

import lzd_utils
df = lzd_utils.read_lazada_csv()
df.head()

Unnamed: 0,country,sku_id,title,category_lvl_1,category_lvl_2,category_lvl_3,desc,price,xb
0,my,AD674FAASTLXANMY,Adana Gallery Suri Square Hijab – Light Pink,Fashion,Women,Muslim Wear,<ul><li>Material : Non sheer shimmer chiffon</li><li>Sizes : 52 x 52 inches OR 56 x 56 inches</li><li>Cut with curved ends</li></ul>,49.0,local
1,my,AE068HBAA3RPRDANMY,Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz,Health & Beauty,Bath & Body,Hand & Foot Care,"Formulated with oil-free hydrating botanicals/ Remarkably improves skin texture of abused hands/Restores soft, smooth & refined hands",128.0,international
2,my,AN680ELAA9VN57ANMY,Andoer 150cm Cellphone Smartphone Mini Dual-Headed Omni-Directional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,"TV, Audio / Video, Gaming & Wearables",Audio,Live Sound & Stage,"<ul> <li>150cm mini microphone compatible for iPhone, various smartphones, and also for iPad/ Apple computer/ Macbook.</li> <li>Dual-headed design, allows for two people using simultaneously.</li> <li>Features high sensitivity &amp; omni-directional sounds output, perfect for audio and video recording.</li> <li>3.5mm standard connector jack.</li> <li>Convenient clip-on design, can clip it on your collar.</li> <li>3.5mm standard connector jack. Convenient clip-on design, can clip it on your collar.</li> </ul>",25.07,international
3,my,AN957HBAAAHDF4ANMY,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520ml + Conditioner 250ml),Health & Beauty,Hair Care,Shampoos & Conditioners,<ul> <li>ANMYNA Complaint Silky Set (Shampoo 520ml + Conditioner 250ml)</li> <li>Deep nourish</li> <li>Repair damaged hair</li> <li>Protect the scalp and prevent hair loss</li> </ul>,118.0,local
4,my,AR511HBAXNWAANMY,Argital Argiltubo Green Clay For Face and Body 250ml,Health & Beauty,Men's Care,Body and Skin Care,<ul> <li>100% Authentic</li> <li>Rrefresh and brighten skin</li> <li>Anti-wrinkle and deep cleansing effects</li> </ul>,114.8,international


### Title

In [2]:
def remove_non_alphanumeric_chars(x):
    """returns a string without non alpha-numeric characters"""
    return re.sub('[^\s0-9a-zA-Z]', '', x)

def uniq_over_total_words(x):
    """returns ratio of no. of unique words over total no. of words"""
    l = word_tokenize(x)
    uniq_words = set(l)
    return len(uniq_words) / len(l)

def uniq_over_total_ngrams(x, min_n=3, max_n=8):
    all_grams = []
    for n in range(min_n,max_n+1):
        ngrams_tuples = ngrams(x, n=n)
        ngrams_list = [''.join(k) for k in ngrams_tuples]
        all_grams.extend(ngrams_list)
    num_uniq_grams, num_grams = len(set(all_grams)), len(all_grams)
    return num_uniq_grams / num_grams

In [3]:
t1 = remove_non_alphanumeric_chars("ANIME ZONE One Piece Anime Lovely Tony Tony Chopper Trendy Bifold Casual Leather Wallet").lower()
print(uniq_over_total_words(t1))
print(uniq_over_total_ngrams(t1))

0.8571428571428571
0.9414141414141414


In [4]:
t2 = remove_non_alphanumeric_chars("Jusian AME2958 Women's Push Up Boned Corset Bustier Corselet Black").lower()
print(uniq_over_total_words(t2))
print(uniq_over_total_ngrams(t2))


1.0
0.9641873278236914


In [5]:
t3 = remove_non_alphanumeric_chars("Women clutch evening bags velvet hard holder purse bags 4 colors clutches shoulder chain evening bags for wedding bridal handbag = Size: Not Specified = Color: YM1053blue").lower()
print(uniq_over_total_words(t3))
print(uniq_over_total_ngrams(t3))

0.88
0.8895768833849329


In [6]:
titles = df['title'].head(50)
titles_AC = titles.apply(remove_non_alphanumeric_chars)

In [7]:
feat_a, feat_b = [titles_AC.apply(uniq_over_total_words),
                  titles_AC.apply(uniq_over_total_ngrams),]
for (a, b) in zip(feat_a, feat_b):
    print("{:.3f},{:.3f}".format(a,b))

1.000,1.000
1.000,1.000
1.000,0.915
1.000,0.994
1.000,0.989
1.000,0.995
0.909,0.962
1.000,1.000
1.000,1.000
1.000,0.994
1.000,0.972
1.000,0.996
1.000,0.997
1.000,0.951
1.000,0.985
1.000,0.997
1.000,0.996
1.000,1.000
1.000,0.983
1.000,1.000
1.000,1.000
1.000,1.000
1.000,0.975
0.923,0.901
1.000,1.000
0.812,0.863
1.000,1.000
1.000,1.000
0.900,0.943
1.000,0.993
0.900,0.961
1.000,0.996
1.000,0.959
1.000,0.975
1.000,1.000
0.917,0.981
1.000,0.986
1.000,0.998
1.000,0.995
1.000,0.997
1.000,0.982
0.810,0.908
1.000,0.974
1.000,1.000
1.000,0.990
1.000,0.995
1.000,0.971
1.000,1.000
1.000,1.000
1.000,0.984


In [8]:
titles_feats = pd.DataFrame({'titles' : titles, 
                             'titles_AC' : titles_AC, 
                             'feat_a' : feat_a, 'feat_b' : feat_b,})

display(titles_feats[['titles','titles_AC','feat_a','feat_b',]])

Unnamed: 0,titles,titles_AC,feat_a,feat_b
0,Adana Gallery Suri Square Hijab – Light Pink,Adana Gallery Suri Square Hijab Light Pink,1.0,1.0
1,Cuba Heartbreaker Eau De Parfum Spray 100ml/3.3oz,Cuba Heartbreaker Eau De Parfum Spray 100ml33oz,1.0,1.0
2,Andoer 150cm Cellphone Smartphone Mini Dual-Headed Omni-Directional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,Andoer 150cm Cellphone Smartphone Mini DualHeaded OmniDirectional Mic Microphone with Collar Clip for iPad iPhone5 6s 6 Plus Smartphones,1.0,0.915082
3,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 520ml + Conditioner 250ml),ANMYNA Complaint Silky Set Shampoo 520ml Conditioner 250ml,1.0,0.993994
4,Argital Argiltubo Green Clay For Face and Body 250ml,Argital Argiltubo Green Clay For Face and Body 250ml,1.0,0.989474
5,Asus TP300LJ-DW004H Transformer Book Flip 4GB Intel Core i5 13 Inch + Free 2Year Ezi Care Warranty,Asus TP300LJDW004H Transformer Book Flip 4GB Intel Core i5 13 Inch Free 2Year Ezi Care Warranty,1.0,0.994536
6,NG-40C Ring-Shaped 40W 3166lm 5400K Macro Photography Light Circle Ring Light,NG40C RingShaped 40W 3166lm 5400K Macro Photography Light Circle Ring Light,0.909091,0.962175
7,Buytra Exfoliating Peel Foot Mask 1Pair,Buytra Exfoliating Peel Foot Mask 1Pair,1.0,1.0
8,CLiPtec OCC121 Slim Flat USB 3.0 Extension Cable 1.5m,CLiPtec OCC121 Slim Flat USB 30 Extension Cable 15m,1.0,1.0
9,McDonald's Coke Can Glass Limited Edition 12oz Purple Color,McDonalds Coke Can Glass Limited Edition 12oz Purple Color,1.0,0.993769


**References**

- [Github / minhcp](https://github.com/minhcp/CIKMCup17) for the dataset

- [CIKM AnalytiCup 2017](http://www.cikmconference.org/CIKM2017/download/analytiCup/session3/) for the feature engineering techniques