# This notebook shows how to develop SVM Classifier for automated Item Categorization using the products name

# Item Categorization

Here, we prepare almost 20000 item titles from e-commerce platform and they are sampled from 19 category. 16000 item titles will be training data which contain titles and labels and the rest 4000 titles will be testing data without labels.
```
Input: Apple iPhone 8 Plus 64GB
Label: Mobile & Gadget

Input: Nike Running Shoes
Label: Men's Shoes
```

In [1]:
import pandas as pd
#all_data is saved in an "csv" format

DATA_PATH = "/Users/daveyap/Desktop/github/handson session/data/train.csv"
all_data = pd.read_csv(DATA_PATH, encoding='utf-8')   

# Get a glimpse over our data
##### item_title
```
A simple pre-processing has been done to filter emoj and other noisy letters. And all letters are transfromed to lower case.
```
##### label
```
Here, we only provide numbers as labels. 
```

In [2]:
pd.set_option('display.max_colwidth', 2000)  #show more
all_data[['clean_title', 'label']].head(5,)

Unnamed: 0,clean_title,label
0,women skirt cute ladies skirt ball gown sweet mini half fitting skirt,1
1,tempered glass stickers sony xz xa glass stickers,8
2,hairclip hair grip pearl hairpins fashion creative gold zinc alloy headdress,6
3,making sugar craft cake mould cake decor baking tool sea coral silicone mold,11
4,korean style large capacity water glass plastic cup space cup outdoor sports ket,11


# Feature Extraction
- Text files are actually series of words (ordered).
- In order to run machine learning algorithms we need to convert the text files into numerical feature vectors.
- We will be using count vector models for our example.
- Briefly, we segment each text file into words (for English splitting by space), and count # of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature dimension.
##### Toy Corpus
```
Item 1: skirt office skirt
Item 2: velle box pleated button down midi skirt.
```
##### Core Vocabulary
```
[skirt, office, velle, box, pleated, button, down, midi]
```
##### Vector Transformation
```
Item 1: [2, 1, 0, 0, 0, 0, 0, 0]
Item 2: [1, 0, 1, 1, 1, 1, 1, 1]
```
Here, we are going to use a high-level API in SKlearn to learn count vectors from the input text.

In [3]:
all_titles = all_data['clean_title'].tolist()
all_labels = all_data['label'].tolist()

In [4]:
print ("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.  
vectorizer = CountVectorizer(max_features = 10000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
all_data_features = vectorizer.fit_transform(all_titles)

all_data_features = all_data_features.toarray()

Creating the bag of words...



In [5]:
vocab = vectorizer.vocabulary_  #from word to index
print (len(vocab))

10000


In [6]:
sample_id = 1
vec_sample = all_data_features[sample_id,:]
print (all_titles[sample_id])
print (all_data_features[sample_id])

tempered glass stickers sony xz xa glass stickers
[0 0 0 ... 0 0 0]


In [7]:
sample_word = 'tempered'
sample_count = all_data_features[sample_id][vocab[sample_word]]
print ("index of %s is %d" %(sample_word, vocab[sample_word]))
print ("count of %s is %d" %(sample_word, sample_count))

index of tempered is 8340
count of tempered is 1


 # Model Training
 Here, we use linear SVM to classify the items into category.
 
 16000 labelled item titles are splitted into training domain and validation domain.

In [8]:
train_x = all_data_features[:12000]
train_y = all_labels[:12000]
test_x = all_data_features[12000:]
test_y = all_labels[12000:]
print ("# of training data is %d" %len(train_x))
print ("# of testing data is %d" %len(test_x))

# of training data is 12000
# of testing data is 3987


### fit the training data into SVM
<img src='image_notebook/svm_illus.png'>.

In [9]:
from sklearn import svm
# Initialize the "LinearSVC" object, which is scikit-learn's
# linear SVM model.
lin_clf = svm.LinearSVC()
# fit( ) will do the model training, i.e., learn the model parameters
lin_clf.fit(train_x, train_y) 

LinearSVC()

In [10]:
from sklearn.metrics import accuracy_score
# predict( ) will do the model prediction, predict y based on the input x
predict_y = lin_clf.predict(test_x)
print ('testing acc is %f' %accuracy_score(predict_y, test_y))

testing acc is 0.801605


Let's brainstorming, the accuray is 80% 1/5 of the products wiould be categorized to wrong category. Seems like products name dataset would be too small for classification. 

If we would like to expand the data to increase the accuracy, what type of data we can acquire? From my view, the description of products seems like a great data to use and test for automated item categorization. 