## Multi-Class Text Classification for E-commerce products using Doc2Vec 

Multi-Class Text Classification to E-commerce products based on their description and categories by using Doc2Vec vectors.

Doc2vec is a method of vector representation of entire documents, not individual words. By document, you can mean a single sentence, paragraph, or an entire book. Doc2Vec architecture has two algorithms. One of the them is called Distributed Bag of Words (DBOW). The second algorithm is “distributed memory” (DM).

In this project, we used Doc2Vec to get the document vectors and we used as input to the classification model. 

### Importing packages and loading data

In [85]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import re
from time import time
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, f1_score

In [86]:
df = pd.read_csv('ecommerceDataset_1.csv')
df.head()

Unnamed: 0,Label,Text
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


In [87]:
df.shape

(50425, 2)

In [88]:
print(df['Text'].astype("str").apply(lambda x: len(x.split(' '))).sum())

5978795


We have 5978795 words in the data.

### Text data clean

In [89]:
#checking missing values
df.isnull().sum()

Label    0
Text     4
dtype: int64

In [90]:
#changing data type
df['Text'] = df['Text'].astype(str)

In [91]:
df.Label.value_counts()

Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: Label, dtype: int64

In the next step we remove non-alphabetic characters, the stopwords and lemmatizing for each line of text:

In [92]:
def cleanText(words):
    """The function to clean text"""
    words = re.sub("[^a-zA-Z]"," ",words)
    text = words.lower().split()
    return " ".join(text)

df['Text'] = df['Text'].apply(cleanText)

In [93]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [94]:
stop = stopwords.words('english')
lem = WordNetLemmatizer()


def remove_stopwords(text):
    """The function to removing stopwords"""
    text = [word.lower() for word in text.split() if word.lower() not in stop]
    return " ".join(text)

def word_lem(text):
    """The function to apply lemmatizing"""
    lem_text = [lem.lemmatize(word) for word in text.split()]
    return " ".join(lem_text)

In [95]:
df['Text'] = df['Text'].apply(remove_stopwords)
df['Text'] = df['Text'].apply(word_lem)

In [96]:
df.head()

Unnamed: 0,Label,Text
0,Household,paper plane design framed wall hanging motivat...
1,Household,saf floral framed painting wood inch x inch sp...
2,Household,saf uv textured modern art print framed painti...
3,Household,saf flower print framed painting synthetic inc...
4,Household,incredible gift india wooden happy birthday un...


In [97]:
df['Text'] = df['Text'].astype(str)

In [98]:
print(df['Text'].apply(lambda x: len(x.split(' '))).sum())

3768558


After text cleaning and removing stop words, we have only 3768558 words.

In [99]:
#save clean data
df.to_csv('products_clean.csv', encoding='utf-8')

### Data preparation

Now we split the text into training and testing sets. Then we create a **TaggedDocument** object because this is what Doc2Vec wants as input.

In [100]:
train, test = train_test_split(df, test_size=0.3, random_state=42)

In [58]:
train_tag = train.apply(lambda x: TaggedDocument(words=word_tokenize(x['Text']), tags=[x.Label]), axis=1)

test_tag = test.apply(lambda x: TaggedDocument(words=word_tokenize(x['Text']), tags=[x.Label]), axis=1)

In [59]:
train_tag[20]

TaggedDocument(words=['saf', 'uv', 'textured', 'modern', 'art', 'print', 'framed', 'painting', 'synthetic', 'cm', 'x', 'cm', 'x', 'cm', 'set', 'color', 'multicolor', 'size', 'cm', 'x', 'cm', 'x', 'cm', 'overview', 'beautiful', 'painting', 'involves', 'action', 'skill', 'using', 'paint', 'right', 'manner', 'hence', 'end', 'product', 'picture', 'speak', 'thousand', 'word', 'say', 'art', 'trend', 'quite', 'time', 'give', 'different', 'viewer', 'different', 'meaning', 'style', 'design', 'saf', 'wood', 'matte', 'abstract', 'painting', 'frame', 'quite', 'abstract', 'mysteriously', 'beautiful', 'painting', 'nice', 'frame', 'gift', 'family', 'friend', 'painting', 'various', 'form', 'certain', 'figure', 'seen', 'image', 'add', 'good', 'set', 'light', 'place', 'painting', 'decor', 'give', 'different', 'feel', 'look', 'place', 'quality', 'durability', 'painting', 'matte', 'finish', 'includes', 'good', 'quality', 'frame', 'last', 'long', 'period', 'however', 'include', 'glass', 'along', 'frame', '

In [61]:
test_tag #[10]

35848    ([kandy, men, regular, fit, blazer, blue, prod...
13005    ([healthsense, chef, mate, k, digital, kitchen...
22719    ([concept, physic, session, set, volume], [Boo...
18453    ([lista, stainless, steel, multi, functional, ...
20867         ([gardening, urban, india, update], [Books])
                               ...                        
36241    ([yellow, chime, flower, band, golden, ring, w...
48181    ([oem, inch, rear, view, tft, lcd, color, car,...
16408    ([hazel, dustpan, supdi, large, cleaning, prod...
31811    ([noise, noihwp, pixel, multifuntional, polyes...
19120    ([taparia, plastic, tool, box, organizer, ptb,...
Length: 15128, dtype: object

### Getting the feature vector from doc2vec model

In this step we initialize the gensim doc2vec model. Doc2Vec architecture is also similar to word2vec and has two algorithms like word2vec and they are the corresponding algorithms for those two algorithms. One of the them is called Distributed Bag of Words (DBOW) which is similar to “Skip-gram” (SG) model in word2vec except that additional paragraph id vector is added. The second algorithm is “distributed memory” (DM) which is similar to “Continuous bag of words” CBOW in word vector.

#### Distributed Bag of Words (DBOW)

First, we instantiate a doc2vec model — Distributed Bag of Words (DBOW). We set the following parameters: 

- dm=0 , distributed bag of words (DBOW) is used;
- vector_size = 100, word embeddings will have shape of;
- window = 2, model will try to predict every second word;
- sample=0 , the threshold for configuring which higher-frequency words are randomly down sampled;
- min_count=2, ignores all words with total frequency lower than this.


In [62]:
doc_model = Doc2Vec(dm=0, vector_size=100, min_count=2, window=2, sample = 0)
               
doc_model.build_vocab(train_tag)

In [63]:
doc_model.corpus_total_words

2650410

Training for 30 epochs: 

In [64]:
%time doc_model.train(train_tag, total_examples=doc_model.corpus_count, epochs=30) 

CPU times: user 4min 34s, sys: 9.5 s, total: 4min 44s
Wall time: 2min 48s


In [65]:
doc_model.most_similar('blush')

  """Entry point for launching an IPython kernel.


[('hugging', 0.4502127170562744),
 ('beany', 0.387437105178833),
 ('quenched', 0.3801702857017517),
 ('pasadena', 0.368612676858902),
 ('bolted', 0.36644747853279114),
 ('ublox', 0.353843629360199),
 ('informational', 0.3534359633922577),
 ('hammering', 0.3485444188117981),
 ('title', 0.3460332751274109),
 ('thesis', 0.3458634614944458)]

In [66]:
#save model
doc_model.save('model.doc2vec')

Building the Final Vector Feature for the Classifier:

In [67]:
def vector_for_learning(model, input_docs):
    sents = input_docs
    targets, feature_vectors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, feature_vectors

In [68]:
y_train, X_train = vector_for_learning(doc_model, train_tag)
y_test, X_test = vector_for_learning(doc_model, test_tag)

### Training the Classifier

We choose Logistic Regression Classifier and Linear Support Vector Machine.

***Logistic Regression with DBOW***

In [69]:
log_reg = LogisticRegression(n_jobs=1, C=5)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


#### Testing the Model

In [70]:
print('Testing accuracy %s' % accuracy_score(y_pred, y_test))
print('Testing F1 score: {}'.format(f1_score(y_test, y_pred, average='weighted')))

Testing accuracy 0.9566367001586462
Testing F1 score: 0.9566194462932198


In [71]:
ytest = np.array(y_test)
print(classification_report(ytest, y_pred))

                        precision    recall  f1-score   support

                 Books       0.95      0.95      0.95      3534
Clothing & Accessories       0.98      0.97      0.97      2612
           Electronics       0.95      0.94      0.94      3125
             Household       0.96      0.96      0.96      5857

              accuracy                           0.96     15128
             macro avg       0.96      0.96      0.96     15128
          weighted avg       0.96      0.96      0.96     15128



***Linear Support Vector Machine with DBOW***

In [72]:
svm = LinearSVC()
svm.fit(X_train, y_train)



LinearSVC()

In [73]:
preds = svm.predict(X_test)
print('Testing accuracy %s' % accuracy_score(preds, y_test))
print('Testing F1 score: {}'.format(f1_score(y_test, preds, average='weighted')))

Testing accuracy 0.9559095716552088
Testing F1 score: 0.9558713253654214


In [74]:
print(classification_report(ytest, preds))

                        precision    recall  f1-score   support

                 Books       0.95      0.95      0.95      3534
Clothing & Accessories       0.98      0.97      0.98      2612
           Electronics       0.95      0.93      0.94      3125
             Household       0.95      0.96      0.96      5857

              accuracy                           0.96     15128
             macro avg       0.96      0.95      0.96     15128
          weighted avg       0.96      0.96      0.96     15128



#### Distributed Memory (DM)

Now, we instantiate a Distributed Memory (DM) with a vector size with 100 words and iterating over the training corpus 30 times.

Distributed Memory (DM) works like a memory that remembers what is missing from the current context or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document.

In [75]:
dm_model = Doc2Vec(dm=1, vector_size=100, min_count=2, window=2, sample = 0, negative=5, alpha=0.025, min_alpha=0.001)
dm_model.build_vocab(train_tag)

In [76]:
dm_model.corpus_total_words

2650410

In [77]:
%time dm_model.train(train_tag, total_examples=dm_model.corpus_count, epochs=30) 

CPU times: user 6min 19s, sys: 22.4 s, total: 6min 42s
Wall time: 4min 28s


In [78]:
dm_model.save('dm_model.doc2vec')

Extract training vectors:

In [79]:
y_train_dm, X_train_dm = vector_for_learning(dm_model, train_tag)
y_test_dm, X_test_dm = vector_for_learning(dm_model, test_tag)

Training and testing Classifiers:

***Logistic Regression with DM***

In [80]:
log_reg = LogisticRegression(n_jobs=1, C=5)
log_reg.fit(X_train_dm, y_train_dm)
pred = log_reg.predict(X_test_dm)

print('Testing accuracy %s' % accuracy_score(y_test_dm, pred))
print('Testing F1 score: {}'.format(f1_score(y_test_dm, pred, average='weighted')))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Testing accuracy 0.8695795875198308
Testing F1 score: 0.8696737601789366


In [81]:
ytest = np.array(y_test_dm)
print(classification_report(ytest, pred))

                        precision    recall  f1-score   support

                 Books       0.84      0.89      0.86      3534
Clothing & Accessories       0.84      0.86      0.85      2612
           Electronics       0.87      0.84      0.85      3125
             Household       0.90      0.88      0.89      5857

              accuracy                           0.87     15128
             macro avg       0.86      0.87      0.86     15128
          weighted avg       0.87      0.87      0.87     15128



***Linear Support Vector Machine with DM***

In [82]:
svm = LinearSVC()
svm.fit(X_train_dm, y_train_dm)
pred_y = svm.predict(X_test_dm)



In [83]:
print('Testing accuracy %s' % accuracy_score(pred_y, y_test_dm))
print('Testing F1 score: {}'.format(f1_score(y_test_dm, pred_y, average='weighted')))

Testing accuracy 0.8720914859862506
Testing F1 score: 0.8721579839542756


In [84]:
print(classification_report(ytest, pred_y))

                        precision    recall  f1-score   support

                 Books       0.82      0.90      0.86      3534
Clothing & Accessories       0.86      0.84      0.85      2612
           Electronics       0.89      0.83      0.86      3125
             Household       0.90      0.89      0.90      5857

              accuracy                           0.87     15128
             macro avg       0.87      0.87      0.87     15128
          weighted avg       0.87      0.87      0.87     15128



From received results we can see the obtained accuracy of DBOW model is better than DM model.

### Conclusion

For this project, we used training set to train doc2vec classifier for our E-commerce products description. We have trained doc2vec ( DBOW and DM) model and decided to use Logistic Regression and Linear Support Vector Machine models. We were able to achieve 95% accuracy for SVM and Logistic Regression with DBOW method.