# Product Categorization

## Part 7: Multi-Class Text Classification of products using Doc2Vec 

The aim of the project was multi-class text classification of make-up products based on their description and categories. In this approach I have used Doc2Vec vectors (DBOW and DM) and machine learning algorithms such as Logistic Regression and Linear Support Vector Machine. Firstly I have used  Doc2Vec model to get the document vectors and then used it as input to the classification model. 

The Doc2vec is a method of vector representation of entire documents, not individual words. By document one can mean a single sentence, paragraph, or an entire book. Doc2Vec architecture has two algorithms. One of the them is called Distributed Bag of Words (DBOW) and the second is “distributed memory” (DM).

### Dataset

The dataset contains the details about makeup products such as brand, category, description, name, etc. It comes from http://makeup-api.herokuapp.com/.

Attributes:

* Description (text: product description)
* Product type (text)

## Import libriaries and data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import re
from time import time
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, f1_score

In [2]:
df = pd.read_csv('data\products_description.csv', header=0,index_col=0)
df.head()

Unnamed: 0,product_type,description
0,lip_liner,Lippie Pencil A long-wearing and high-intensit...
1,lipstick,Blotted Lip Sheer matte lipstick that creates ...
2,lipstick,"Lippie Stix Formula contains Vitamin E, Mango,..."
3,foundation,"Developed for the Selfie Age, our buildable fu..."
4,lipstick,All of our products are free from lead and hea...


First observations:

In [3]:
df.shape

(906, 2)

Checking missing values:

In [8]:
df.isnull().sum()

product_type    0
description     0
dtype: int64

In [7]:
print(df['description'].apply(lambda x: len(x.split(' '))).sum())

94257


There are 94257 words in the data and no missing values.

## Data Preparation 

In this part I have cleaned text data by change data type and grouping data to smaller number of categories.

In [9]:
# changing data type
df['description'] = df['description'].astype(str)

In [10]:
# grouping data to smaller number of categories
df.loc[df['product_type'].isin(['lipstick','lip_liner']),'product_type'] = 'lipstic'
df.loc[df['product_type'].isin(['blush','bronzer']),'product_type'] = 'contour'
df.loc[df['product_type'].isin(['eyeliner','eyeshadow','mascara','eyebrow']),'product_type'] = 'eye_makeup'

In [11]:
df.product_type.value_counts()

eye_makeup     367
lipstic        176
foundation     159
contour        144
nail_polish     60
Name: product_type, dtype: int64

#### Text preprocessing

In this part I removed non-alphabetic characters, the stopwords and applied lemmatization for each line of text.

In [12]:
def cleanText(words):
    """The function to clean text"""
    words = re.sub("[^a-zA-Z]"," ",words)
    text = words.lower().split()
    return " ".join(text)

df['description'] = df['description'].apply(cleanText)

In [13]:
stop = stopwords.words('english')
lem = WordNetLemmatizer()


def remove_stopwords(text):
    """The function to removing stopwords"""
    text = [word.lower() for word in text.split() if word.lower() not in stop]
    return " ".join(text)

def word_lem(text):
    """The function to apply lemmatizing"""
    lem_text = [lem.lemmatize(word) for word in text.split()]
    return " ".join(lem_text)

In [14]:
df['description'] = df['description'].apply(remove_stopwords)
df['description'] = df['description'].apply(word_lem)

In [15]:
df.head()

Unnamed: 0,product_type,description
0,lipstic,lippie pencil long wearing high intensity lip ...
1,lipstic,blotted lip sheer matte lipstick creates perfe...
2,lipstic,lippie stix formula contains vitamin e mango a...
3,foundation,developed selfie age buildable full coverage n...
4,lipstic,product free lead heavy metal parabens phthala...


In [16]:
df['description'] = df['description'].astype(str)

In [17]:
print(df['description'].apply(lambda x: len(x.split(' '))).sum())

60892


After text cleaning and removing stop words there are only 60 892 words.

In [34]:
#save clean data
df.to_csv('C:\Python Scripts\API_products\products_clean.csv', encoding='utf-8')

## Model creation

This step splitting the text into training and testing sets. Then created a **TaggedDocument** object like as an input.

In [18]:
train, test = train_test_split(df, test_size=0.3, random_state=42)

In [19]:
train_tag = train.apply(lambda x: TaggedDocument(words=word_tokenize(x['description']), tags=[x.product_type]), axis=1)

test_tag = test.apply(lambda x: TaggedDocument(words=word_tokenize(x['description']), tags=[x.product_type]), axis=1)

In [20]:
train_tag[20]

TaggedDocument(words=['pressed', 'foundation', 'marienatie', 'providing', 'silky', 'flawless', 'finish', 'provides', 'great', 'coverage', 'protects', 'skin', 'spf', 'titanium', 'dioxide', 'act', 'absorbent', 'oil', 'jojoba', 'oil', 'help', 'cleanse', 'moisturize', 'skin'], tags=['foundation'])

In [21]:
test_tag[10]

TaggedDocument(words=['let', 'eye', 'naturally', 'pop', 'b', 'smudged', 'subtle', 'eye', 'color', 'add', 'tint', 'color', 'base', 'lash', 'organic', 'cream', 'eye', 'color', 'b', 'smudged', 'eliminates', 'inevitable', 'uneven', 'line', 'traditional', 'eyeliner', 'require', 'expert', 'blending', 'technique', 'messy', 'powder', 'based', 'shadow', 'simply', 'smudge', 'along', 'lash', 'line', 'color', 'stay', 'place', 'long', 'lasting', 'look'], tags=['eye_makeup'])

### Getting the feature vector from doc2vec model

In this part I have initialized the gensim doc2vec model. Doc2Vec architecture is also similar to word2vec and has two algorithms like word2vec and they are the corresponding algorithms for those two algorithms. One of the them is Distributed Bag of Words (DBOW) which is similar to “Skip-gram” (SG) model in word2vec except that additional paragraph id vector is added. The second algorithm is “distributed memory” (DM) which is similar to “Continuous bag of words” CBOW in word vector.

### Distributed Bag of Words (DBOW)

 dm = 0 , distributed bag of words (DBOW);
- vector_size = 100,  shape of word embeddings ;
- window = 2, model try to predict every second word;
- sample = 0 , the threshold for configuring which higher-frequency words are randomly down sampled;
- min_count = 2, ignores all words with total frequency lower than this.


In [22]:
doc_model = Doc2Vec(dm=0, vector_size=100, min_count=2, window=2, sample = 0)
               
doc_model.build_vocab(train_tag)

  "C extension not loaded, training will be slow. "


In [23]:
doc_model.corpus_total_words

43497

Training for 30 epochs: 

In [24]:
%time doc_model.train(train_tag, total_examples=doc_model.corpus_count, epochs=30) 

Wall time: 10min 32s


In [25]:
doc_model.most_similar('blush')

  """Entry point for launching an IPython kernel.


[('beneficial', 0.3377334177494049),
 ('entire', 0.30722713470458984),
 ('smoother', 0.2944685220718384),
 ('fix', 0.2937236726284027),
 ('intensify', 0.2861747443675995),
 ('b', 0.2843270003795624),
 ('page', 0.2822366952896118),
 ('withstand', 0.279226154088974),
 ('terracotta', 0.2777043879032135),
 ('isopropylparaben', 0.2721264958381653)]

In [32]:
#save model
doc_model.save('model.doc2vec')

##### Building the Final Vector Feature for the Classifier:

In [26]:
def vector_for_learning(model, input_docs):
    sents = input_docs
    targets, feature_vectors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, feature_vectors

In [27]:
y_train, X_train = vector_for_learning(doc_model, train_tag)
y_test, X_test = vector_for_learning(doc_model, test_tag)

### Training the Classifier model

I have choose Logistic Regression Classifier and Linear Support Vector Machine.

***Logistic Regression with DBOW***

In [28]:
log_reg = LogisticRegression(n_jobs=1, C=5)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)



##### Testing the Model

In [29]:
print('Testing accuracy %s' % accuracy_score(y_pred, y_test))
print('Testing F1 score: {}'.format(f1_score(y_test, y_pred, average='weighted')))

Testing accuracy 0.9154411764705882
Testing F1 score: 0.9149097862036379


In [30]:
ytest = np.array(y_test)
print(classification_report(ytest, y_pred))

              precision    recall  f1-score   support

     contour       0.84      0.90      0.87        40
  eye_makeup       0.90      0.98      0.94       115
  foundation       0.98      0.87      0.92        53
     lipstic       0.93      0.82      0.87        51
 nail_polish       1.00      0.92      0.96        13

    accuracy                           0.92       272
   macro avg       0.93      0.90      0.91       272
weighted avg       0.92      0.92      0.91       272



***Linear Support Vector Machine with DBOW***

In [31]:
svm = LinearSVC()
svm.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [32]:
preds = svm.predict(X_test)
print('Testing accuracy %s' % accuracy_score(preds, y_test))
print('Testing F1 score: {}'.format(f1_score(y_test, preds, average='weighted')))

Testing accuracy 0.9154411764705882
Testing F1 score: 0.9154729253555255


In [33]:
print(classification_report(ytest, preds))

              precision    recall  f1-score   support

     contour       0.82      0.90      0.86        40
  eye_makeup       0.91      0.97      0.94       115
  foundation       1.00      0.87      0.93        53
     lipstic       0.91      0.84      0.88        51
 nail_polish       1.00      0.92      0.96        13

    accuracy                           0.92       272
   macro avg       0.93      0.90      0.91       272
weighted avg       0.92      0.92      0.92       272



### Distributed Memory (DM)

Created a Distributed Memory (DM) with a vector size with 100 words and iterating over the training corpus 30 times.

Distributed Memory (DM) works like a memory that remembers what is missing from the current context or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document.

In [34]:
dm_model = Doc2Vec(dm=1, vector_size=100, min_count=2, window=2, sample = 0, negative=5, alpha=0.025, min_alpha=0.001)
dm_model.build_vocab(train_tag)

  "C extension not loaded, training will be slow. "


In [35]:
dm_model.corpus_total_words

43497

In [36]:
%time dm_model.train(train_tag, total_examples=dm_model.corpus_count, epochs=30) 

Wall time: 16min 19s


In [21]:
dm_model.save('dm_model.doc2vec')

Extract training vectors:

In [37]:
y_train_dm, X_train_dm = vector_for_learning(dm_model, train_tag)
y_test_dm, X_test_dm = vector_for_learning(dm_model, test_tag)

Training and testing Classifiers:

***Logistic Regression with DM***

In [38]:
log_reg = LogisticRegression(n_jobs=1, C=5)
log_reg.fit(X_train_dm, y_train_dm)
pred = log_reg.predict(X_test_dm)

print('Testing accuracy %s' % accuracy_score(y_test_dm, pred))
print('Testing F1 score: {}'.format(f1_score(y_test_dm, pred, average='weighted')))



Testing accuracy 0.9007352941176471
Testing F1 score: 0.9005217056526574


In [39]:
ytest = np.array(y_test_dm)
print(classification_report(ytest, pred))

              precision    recall  f1-score   support

     contour       0.78      0.90      0.84        40
  eye_makeup       0.93      0.97      0.95       115
  foundation       0.87      0.87      0.87        53
     lipstic       0.95      0.78      0.86        51
 nail_polish       1.00      0.92      0.96        13

    accuracy                           0.90       272
   macro avg       0.91      0.89      0.89       272
weighted avg       0.90      0.90      0.90       272



***Linear Support Vector Machine with DM***

In [40]:
svm = LinearSVC()
svm.fit(X_train_dm, y_train_dm)
pred_y = svm.predict(X_test_dm)



In [41]:
print('Testing accuracy %s' % accuracy_score(pred_y, y_test_dm))
print('Testing F1 score: {}'.format(f1_score(y_test_dm, pred_y, average='weighted')))

Testing accuracy 0.9007352941176471
Testing F1 score: 0.9004546938043879


In [42]:
print(classification_report(ytest, pred_y))

              precision    recall  f1-score   support

     contour       0.80      0.88      0.83        40
  eye_makeup       0.95      0.97      0.96       115
  foundation       0.84      0.87      0.85        53
     lipstic       0.93      0.78      0.85        51
 nail_polish       1.00      0.92      0.96        13

    accuracy                           0.90       272
   macro avg       0.90      0.88      0.89       272
weighted avg       0.90      0.90      0.90       272



From received results one can see the obtained accuracy of DBOW model is better than DM model.

##  Summary

For this project I have used training set to train doc2vec classifier for the products description. I have trained doc2vec (DBOW and DM) model and decided to use Logistic Regression and Linear Support Vector Machine models, which received the best results in previous testing model (combination Bag of Words with TF-IDF). I have achieved 92% accuracy for SVM and Logistic Regression with DBOW method.