## Part II-Feature Engineering

### Feature Types
- typ1: title to unigram word vector
- typ2: title to bigram word vector
- typ3: desc to unigram word vector
- typ4: desc to bigram word vector

### Output
- unigram/bigram vocabularies: voc,count
- scipy sparse matrix

In [1]:
import pandas as pd

train_data = pd.read_csv('./training/data_train_no_html.csv', names=['country','uid','title','cat_lv1','cat_lv2','cat_lv3','short_desc','price','prod_type'], encoding='utf8')
train_data = train_data.drop(['country','uid','cat_lv1','cat_lv2','cat_lv3','price','prod_type'], axis=1)
train_data.head(5)

Unnamed: 0,title,short_desc
0,Adana Gallery Suri Square Hijab – Light Pink,Material : Non sheer shimmer chiffonSizes : 5...
1,Cuba Heartbreaker Eau De Parfum Spray 100ml/3...,Formulated with oil-free hydrating botanicals...
2,Andoer 150cm Cellphone Smartphone Mini Dual-H...,150cm mini microphone compatible for iPhone v...
3,ANMYNA Complaint Silky Set 柔顺洗发配套 (Shampoo 52...,ANMYNA Complaint Silky Set (Shampoo 520ml + C...
4,Argital Argiltubo Green Clay For Face and Bod...,100% Authentic Rrefresh and brighten skin Ant...


### Type 1: word vectors - unigram (title)

In [2]:
# title word vector - unigram
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

count_vec = CountVectorizer()
titles_cnt = count_vec.fit_transform(train_data["title"])
dist = np.sum(titles_cnt.toarray(), axis=0)
tfidf_transformer = TfidfTransformer()
titles_tfidf = tfidf_transformer.fit_transform(titles_cnt)
titles_vocab = count_vec.get_feature_names()

print(3,train_data["title"][3])
for v, count in zip(titles_vocab[33764:33775], dist[33764:33775]): print v, count

(3, u' ANMYNA Complaint Silky Set \u67d4\u987a\u6d17\u53d1\u914d\u5957 (Shampoo 520ml + Conditioner 250ml)')
柔顺洗发配套 1
治神經衰弱 1
甜杏仁油 1
石淋通片100 1
逍遥丸 1
钢丝梳 1
雪绒花靓肤水嫩洁面乳 1
해외배송 26


In [3]:
# vocalbulary list
print len(titles_vocab), "title unigrams"
import csv
with open('./feature/unigram_title_vocab.csv', 'wb') as myfile:
    wr = csv.writer(myfile)
    for word, c in zip(titles_vocab,dist):
        wr.writerow([word.encode('utf-8'),c])

33772 title unigrams


### type 2: word vectors - ngram (title), n=2

In [4]:
# title word vector - bigram
bigram_vec = CountVectorizer(ngram_range=(2, 2), min_df=1) 
titles_bigram_cnt = bigram_vec.fit_transform(train_data["title"])
dist = np.sum(titles_bigram_cnt.toarray(), axis=0)
titles_bigram_vocab = bigram_vec.get_feature_names()
titles_bigram_tfidf = tfidf_transformer.fit_transform(titles_bigram_cnt)

print(3,train_data["title"][3])
for v, count in zip(titles_bigram_vocab[191059:191070], dist[191059:191070]): print v, count

(3, u' ANMYNA Complaint Silky Set \u67d4\u987a\u6d17\u53d1\u914d\u5957 (Shampoo 520ml + Conditioner 250ml)')
柔顺洗发配套 shampoo 1
治神經衰弱 25 1
石淋通片100 3boxes 1
逍遥丸 治神經衰弱 1
钢丝梳 发网 1
雪绒花靓肤水嫩洁面乳 100g 1


In [5]:
# vocalbulary list
print len(titles_bigram_vocab), "title bigrams"
import csv
with open('./feature/bigram_title_vocab.csv', 'wb') as myfile:
    wr = csv.writer(myfile)
    for word, count in zip(titles_bigram_vocab,dist):
        wr.writerow([word.encode('utf-8'),count])

191065 title bigrams


### type 3: word vectors - unigram (desc)

In [6]:
# desc word vector - unigram
count_vec = CountVectorizer()
desc_cnt = count_vec.fit_transform(train_data["short_desc"])
dist = np.sum(desc_cnt.toarray(), axis=0)
tfidf_transformer = TfidfTransformer()
desc_tfidf = tfidf_transformer.fit_transform(desc_cnt)
desc_vocab = count_vec.get_feature_names()

for v, count in zip(desc_vocab[45738:45745], dist[45738:45745]): print v, count

软化血管 1
防止血栓形成 1
阻斷黑色素形成 1
降低血液浓度及血液粘稠度 1
预防与治疗冠心病 1
预防与治疗脑血栓 1
高血糖 1


In [7]:
# vocalbulary list
print len(desc_vocab), "desc unigrams"
import csv
with open('./feature/unigram_desc_vocab.csv', 'wb') as myfile:
    wr = csv.writer(myfile)
    for word, count in zip(desc_vocab,dist):
        wr.writerow([word.encode('utf-8'),count])

45751 desc unigrams


### type 4: word vectors - bigram (desc)

In [8]:
# desc word vector - bigram
bigram_vec = CountVectorizer(ngram_range=(2, 2), min_df=1)
desc_bigram_cnt = bigram_vec.fit_transform(train_data["short_desc"])
#dist = np.sum(desc_bigram_cnt.toarray(), axis=0)
desc_bigram_tfidf = tfidf_transformer.fit_transform(desc_bigram_cnt)
bigram_desc_vocab = bigram_vec.get_feature_names()


In [9]:
# vocalbulary list
print len(bigram_desc_vocab), "desc bigrams"
import csv
with open('./feature/bigram_desc_vocab.csv', 'wb') as myfile:
    wr = csv.writer(myfile)
    for word in bigram_desc_vocab:
        wr.writerow([word.encode('utf-8')])

383607 desc bigrams


## Output: load/save sparse matrix 

In [34]:
def save_sparse_csr(filename,array):
    x_coo = array.tocoo()
    np.savez(filename, data = x_coo.data, row=x_coo.row, col=x_coo.col, shape=x_coo.shape)

def load_sparse_csr(filename):
    from scipy import sparse
    loader = np.load(filename)
    return sparse.coo_matrix((loader['data'], (loader['row'], loader['col'])), shape=loader['shape'])

In [35]:
save_sparse_csr("unigram_title", titles_tfidf)
save_sparse_csr("bigram_title", titles_bigram_tfidf)
save_sparse_csr("unigram_desc", desc_tfidf)
save_sparse_csr("bigram_desc", desc_bigram_tfidf)

### example: loading unigram_title

In [53]:
test = load_sparse_csr("unigram_title.npz")
print test.shape, titles_tfidf.shape
print titles_tfidf[0], "titles_tfidf[0]"
print test.tocsr()[0], "test.tocsr()[0]"

(36283, 33772) (36283, 33772)
  (0, 24307)	0.204266817035
  (0, 19735)	0.213036224452
  (0, 16806)	0.374790199687
  (0, 28670)	0.284224364364
  (0, 29366)	0.493701265662
  (0, 15291)	0.45183799311
  (0, 6064)	0.493701265662 titles_tfidf[0]
  (0, 6064)	0.493701265662
  (0, 15291)	0.45183799311
  (0, 16806)	0.374790199687
  (0, 19735)	0.213036224452
  (0, 24307)	0.204266817035
  (0, 28670)	0.284224364364
  (0, 29366)	0.493701265662 test.tocsr()[0]
