**Background and Context:**

Product categorization also referred to as product classification, is a field of study within natural language processing (NLP). It is also one of the biggest challenges for e-commerce companies. With the advancement of AI technology, researchers have been applying machine learning to product categorization problems.

Product categorization is the placement and organization of products into their respective categories. In that sense, it sounds simple: choose the correct department for a product. However, this process is complicated by the sheer volume of products on many e-commerce platforms. Furthermore, many products could belong to multiple categories. There are many reasons why product categorization is important for e-commerce and marketing. Through the accurate classification of your products, you can increase conversion rates, strengthen your search engine, and improve your site’s Google ranking.

A well-built product taxonomy allows customers to find what they are looking for quickly and easily. Making your site easy to navigate is one of the most important elements of your UX and will lead to higher conversion rates. Correctly categorizing products allows your search engine to fetch products quicker. As a result, you create a quicker and more accurate search engine. Once you have a strong product taxonomy in place, this will allow you to create the relevant landing pages for your products. In turn, Google and other search engines will be able to index your site and your products more easily. In the end, this allows your products to rank higher on search engines, increasing the chance that customers find your site.

To help merchants choose the correct category, e-commerce companies have automated product categorization tools available. After simply inputting the title or a few words about the product, the system can automatically choose the correct category for you.

**Objective:**

To implement the technique of product categorization

In [1]:
# install and import necessary libraries.

#!pip install contractions

!pip install wordcloud



In [2]:
import random


import re, string, unicodedata                          # Import Regex, string and unicodedata - Used for Text PreProcessing.
from bs4 import BeautifulSoup                           # Import BeautifulSoup.

import numpy as np                                      # Import numpy.
import pandas as pd                                     # Import pandas.
import nltk                                             # Import Natural Language Tool-Kit.

nltk.download('stopwords')                              # Download Stopwords.
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords                       # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.
import matplotlib.pyplot as plt                         # Import plt for visualization


import pandas as pd 
import matplotlib.pyplot as plt                                         # Used for plotting
import seaborn as sns                                                   # Used for plotting
from collections import Counter                                           # count the key-value pairs in an object

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator           # Used for plotting the wordcloud of a corpus
import nltk                                                               # Used for different task of NLP
from nltk.corpus import stopwords                                         # Used for removal of stop words
import warnings                                                           
warnings.filterwarnings("ignore")
from nltk.stem.porter import PorterStemmer                                #Used for Stemming of words in the corpus



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Remove limits on maximum rows and columns as well as column width
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [4]:
data=pd.read_csv('ecommerce_dataset.csv')                             #Importing the data

In [5]:
data.head() # Read the top 5 rows of the data

Unnamed: 0,Label,Text,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43,Unnamed: 44,Unnamed: 45,Unnamed: 46,Unnamed: 47,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62,Unnamed: 63,Unnamed: 64,Unnamed: 65,Unnamed: 66,Unnamed: 67,Unnamed: 68,Unnamed: 69,Unnamed: 70,Unnamed: 71,Unnamed: 72,Unnamed: 73,Unnamed: 74,Unnamed: 75,Unnamed: 76,Unnamed: 77,Unnamed: 78,Unnamed: 79,Unnamed: 80,Unnamed: 81,Unnamed: 82,Unnamed: 83,Unnamed: 84,Unnamed: 85,Unnamed: 86,Unnamed: 87,Unnamed: 88,Unnamed: 89,Unnamed: 90,Unnamed: 91,Unnamed: 92,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98,Unnamed: 99,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109,Unnamed: 110,Unnamed: 111,Unnamed: 112,Unnamed: 113,Unnamed: 114,Unnamed: 115,Unnamed: 116,Unnamed: 117,Unnamed: 118,Unnamed: 119,Unnamed: 120,Unnamed: 121,Unnamed: 122,Unnamed: 123,Unnamed: 124,Unnamed: 125,Unnamed: 126,Unnamed: 127,Unnamed: 128,Unnamed: 129,Unnamed: 130,Unnamed: 131,Unnamed: 132,Unnamed: 133,Unnamed: 134,Unnamed: 135,Unnamed: 136,Unnamed: 137,Unnamed: 138,Unnamed: 139,Unnamed: 140,Unnamed: 141,Unnamed: 142,Unnamed: 143,Unnamed: 144,Unnamed: 145,Unnamed: 146,Unnamed: 147,Unnamed: 148,Unnamed: 149,Unnamed: 150,Unnamed: 151,Unnamed: 152,Unnamed: 153,Unnamed: 154,Unnamed: 155,Unnamed: 156,Unnamed: 157,Unnamed: 158,Unnamed: 159,Unnamed: 160,Unnamed: 161,Unnamed: 162,Unnamed: 163,Unnamed: 164,Unnamed: 165,Unnamed: 166,Unnamed: 167,Unnamed: 168,Unnamed: 169,Unnamed: 170,Unnamed: 171,Unnamed: 172,Unnamed: 173,Unnamed: 174,Unnamed: 175,Unnamed: 176,Unnamed: 177,Unnamed: 178,Unnamed: 179,Unnamed: 180,Unnamed: 181,Unnamed: 182,Unnamed: 183,Unnamed: 184,Unnamed: 185,Unnamed: 186,Unnamed: 187,Unnamed: 188,Unnamed: 189,Unnamed: 190,Unnamed: 191,Unnamed: 192,Unnamed: 193,Unnamed: 194,Unnamed: 195,Unnamed: 196,Unnamed: 197,Unnamed: 198,Unnamed: 199,Unnamed: 200,Unnamed: 201,Unnamed: 202,Unnamed: 203
0,Household,"Styleys Wrought Iron Coat Rack Hanger Creative Fashion Bedroom for Hanging Clothes Shelves, Wrought Iron Racks Standing Coat Rack (Black) Color Name:Black Styleys Coat Stand is great for homes and rooms with limited space, as having one standing rack takes up less space compared to drawers and cupboards. Easy for guests to keep their items, especially bags and scarves, when visiting, as they can always keep an eye on it and easily grab it when they're leaving. Makes a smart décor piece for your home or room as occupied stands can show off your stylish handbags, accessories, and hangman achievement medals. Dimensions: 45cm x 31cm x 175cm Weight: 2.4kg Material: steel Colour: white, black, or pink No. of hook: 7 + 3 (straight pegs) Suitable to hang coats, clothes, scarves, handbags, hats, and accessories",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Household,"Cuisinart CCO-50BKN Deluxe Electric Can Opener, Black Size:None | Color Name:Black Style, convenience, and power come together in the Cuisinart electric can open. With chrome accents and elegant contours, it fits nicely with other modern countertop appliances. The easy single-touc",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Household,Anchor Penta 6 Amp 1 -Way Switch (White) - Pack of 20 Anchor Penta 6 Amp 1 -Way Switch (White)- Pack of 20 comes with Spark Shield - Concealed Terminals - Silver Cadmium Contacts - IP 20 Protection - Captive Screw.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,Clothing & Accessories,"Proline Men's Track Jacket Proline Woven, 100% Polyester High neck Wind Cheater with colour Blocked Detail",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,Household,"Chef's Garage 2 Slot Edge Grip Kitchen Knife Sharpener, Helps to Sharpen The Dull Knives (Black) Chef's Garage Mini Knife sharpener helps to sharpen your dull knives. This tiny knife sharpener has 2 stage sharpening system. First stage is for damaged and dull knives, it will sharpen the knife on the coarse slot. The coarse slot is made of carbide. Second stage is fine slot, once you have honed the knife on coarse slot it will helps to give the finishing touch. The fine slot is made of ceramic for fine sharpening. It’s give a quick touch up on already sharper knives or for finishing off knives that have already passed through the coarse slot.Also it comes with one of the unique edge grip feature to sharpen on the edge of the table or counter top. Key Features: Very easy to use. Non-slip base for added stability and control Carbide and ceramic blades on these sharpening slots are long lasting. Strong and hard with flexibility of an edge grip feature for bigger knives Small in size 9.50 x 5.0 x 4.50 cms. Weights less - 70 grams Instructions:1. Insert the blade into the slot at a 90-degree angle to the mini sharpener.2. Place the edge in coarse slot (Black in color)3. Pull the knife straight back towards you 2 to 3 times while applying a light pressure.4. Place the blade in fine slot (White in color)5. Pull the knife straight back towards you 5 to 6 times while applying a heavy pressure.6. If blade is still dull repeat these steps until blade is sharp.",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [6]:
def process_main_data(real_data,Label):   # Function to clean the text to remove duplicates and label to have four labels only.
    list_label=['Household','Books','Clothing & Accessories', 'Electronics']
    text_label=[]
    for i,j in enumerate(real_data["Label"]):
      if j not in list_label:
        j='not_found'
        text_label.append(i)
    real_data_drop=real_data.drop(index=real_data.iloc[text_label].index)
    real_data_drop=real_data_drop[~real_data_drop.duplicated()]
    real_data_drop=real_data_drop.loc[:, ["Label","Text"]]
    return real_data_drop

In [7]:
real_data_drop=process_main_data(data,'Label')

In [8]:
real_data_new=real_data_drop.copy()

## Strip html tags

In [9]:
def strip_html(words):
    soup = str(BeautifulSoup(words, "html.parser").get_text()) 
      
    return soup

In [10]:
def strip_html_2(words):
    new_words = []
    for word in words:
      soup = BeautifulSoup(word, "html.parser").get_text()  
      new_words.append(str(soup))
    return new_words

#### Pre processing: Remove https


In [11]:
def remove_https(words):                                     # Function to remove https
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = re.sub(r"http\S+", " ", str(word))
        if new_word != '':
            new_words.append(new_word)    # Append processed words to new list.
    return new_words

In [12]:
def remove_https_2(words):
    new_words = re.sub(r"http\S+", " ", str(words))
    return new_words

#### Pre processing: De-contraction of words


In [13]:
!pip install contractions



In [14]:
import contractions

In [15]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(str(text))



#### Pre Processing : Removal of number


In [16]:
def remove_digits(words):
  new_words = []                        # Create empty list to store pre-processed words.
  for word in words:
      new_word = re.sub(r'\d+', ' ', str(word))
      if new_word != ' ':
          new_words.append(new_word)    # Append processed words to new list.
  return new_words

In [17]:
def remove_digits_2(words):
    new_words = re.sub(r'\d+', ' ', str(words))
    return new_words

In [18]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]   

True

In [19]:
 import nltk
 nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Pre Processing - Tokenization

In [20]:
def word_tokenizing(words):
  new_words=nltk.word_tokenize(str(words))
  return new_words


### Preprocessing-Lowercase

In [21]:
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []     
    for word in words:
        new_word = word.lower()           # Converting to lowercase
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

### Preprocessing: Removal of Punctuation

In [22]:
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = re.sub(r'[^\w\s]', ' ', word)
        if new_word != ' ':
            new_words.append(new_word)    # Append processed words to new list.
    return new_words

In [23]:
def remove_punctuation_2(words):
    """Remove punctuation from list of tokenized words"""
    new_words = re.sub(r'[^\w\s]', ' ', str(words))
    return new_words

### Pre-Processing - Removal of stopwords


In [24]:
import nltk
nltk.download('stopwords')      #Downloading nltk corpus

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [25]:
stopwords = stopwords.words('english')
"""
customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

#stopwords = list(set(stopwords) - set(customlist)) 
# Custom stop-word's list will not matter in this analysis as we are not interested in positive or negative sentiments.

 
"""
stopwords = list(set(stopwords)) 
stopwords_2=stopwords

In [26]:
def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        if word not in stopwords:
            new_words.append(word)        # Append processed words to new list.
    return new_words

### Pre-Processing: Lemmatization


In [27]:
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.

In [28]:
import nltk
nltk.download('wordnet')



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [29]:
from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.
lemmatizer = WordNetLemmatizer()


In [30]:
def lemmatize_list(words):
    new_words = []
    for word in words:
      new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

### Pre-Processing: Remove Non-ASCII


In [31]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)        # Append processed words to new list.
    return new_words

### Pre-Processing: Remove apostrophe or other non-lowercase alphabets after other preprocessing 

In [32]:
def remove_apostrophe(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word = re.sub(r'[^a-z]+', ' ', word)
        new_words.append(new_word)    # Append processed words to new list.
    return new_words

In [33]:
! pip install -U textblob
! python -m textblob.download_corpora

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\amina\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!

In [34]:
from textblob import TextBlob 
from textblob import Word

### Pre-Processing: Auto-correct wrongly spelt words


In [35]:
def autocorrect_words(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []                        # Create empty list to store pre-processed words.
    for word in words:
        new_word=str(TextBlob(word).correct())
        new_words.append(new_word)    # Append processed words to new list.
    return new_words

## Pipeline for productionizing the model


In [36]:
# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder


# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

from sklearn.feature_extraction.text import CountVectorizer


# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# This will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

<IPython.core.display.Javascript object>

In [37]:
from sklearn import preprocessing


<IPython.core.display.Javascript object>

In [38]:
def process_main_data(
    real_data, Label
):  # Function to clean the text to remove duplicates and label to have four labels only.
    list_label = ["Household", "Books", "Clothing & Accessories", "Electronics"]
    text_label = []
    for i, j in enumerate(real_data["Label"]):
        if j not in list_label:
            j = "not_found"
            text_label.append(i)
    real_data_drop = real_data.drop(index=real_data.iloc[text_label].index)
    real_data_drop = real_data_drop[~real_data_drop.duplicated()]
    real_data_drop = real_data_drop.loc[:, ["Label", "Text"]]
    return real_data_drop

<IPython.core.display.Javascript object>

In [39]:
creating_data = process_main_data(data, "Label")

<IPython.core.display.Javascript object>

In [40]:
creating_data["Label"].replace(
    {"Household": 0, "Books": 1, "Clothing & Accessories": 2, "Electronics": 3},
    inplace=True,
)

<IPython.core.display.Javascript object>

In [41]:
creating_data["Label"].value_counts()

0    10563
1     6256
2     5675
3     5308
Name: Label, dtype: int64

<IPython.core.display.Javascript object>

In [42]:
def normalize_3(words):  # Normalize the functions/data
    words = strip_html(str(words))
    words = remove_https_2(str(words))
    words = replace_contractions(str(words))
    words = remove_digits_2(str(words))

    words = word_tokenizing(str(words))
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)

    words = lemmatize_list(words)

    words = remove_non_ascii(words)
    words = remove_apostrophe(words)
    # words=autocorrect_words(words)  # Not needed in this project - to maintain original intent of data collected

    return " ".join(words)

<IPython.core.display.Javascript object>

In [43]:
creating_data_ = creating_data.copy()


<IPython.core.display.Javascript object>

In [44]:
creating_data_["Text"] = creating_data_.apply(
    lambda row: normalize_3(str(row["Text"])), axis=1
)  # fairly cleaned data input

creating_data_.sample(n=25)

Unnamed: 0,Label,Text
9378,0,laminea pcs cabinet handle drawe decorative pull knob hardware ss handle inch length inch rust corrosion proof manufacture superior quality material modern tool guidance professionals solid steel exquisitely finish laminea supreme leader well establish organization today cater requirements world best brand architectural hardware hardware specifications accurate first time review plan walk job ask question ensure get exactly need laminea make every efforts team clients provide wide ranging hardware solutions residences businesses
46870,1,bed procrustes philosophical practical aphorisms
44135,1,ted talk official ted guide public speak tip trick give unforgettable speeches presentations review single recipe great speech course essential ingredients ted team set concision verve wit also ingredients inspire contemporary guide venerable arts oratory excellent easily best public speak guide read nobody world better understand art science public speak chris anderson absolutely perfect person write book gift many this insightful book ever write public speaking it also brilliant profound look communicate ever plan utter sound must read give hope word actually change world the ted talk reinvent art rhetoric st century know ideas worth spread indeed spread far wide clarity panache behind revolution lie chris anderson vision powerful ideas improve world develop coherent philosophy set guidelines compel communication the ted talk may well define essay genre time pamphlet th century newspaper op ed twentieth ted talk guidebook new language write man make global force for anyone get story tell audience want engage book must read book description official ted guide public speak new york time bestseller see product description
25011,3,hp sprocket portable photo printer black colour black instant snapshots anywhere create x cm x inch stickable snapshots virtually anywhere the rechargeable battery print sheet paper per charge recharge minutes
12341,0,swarg label plastic handheld garment facial electric iron steam portable handy vapour steamer pink handheld garment facial steamer product characteristics reduce wrinkle remove dust sterilize disinfect high temperature steam humidifying care face physiotherapy soft steam secure harm clothe also convenient carry since ligh weight built in visible water level tank own double safety protection device cut power short water remove dust reduce wrinkle sterilization disinfection humidification facial care beauty facial steamer deep skin moisturize deep skin cleanse soften skin cuticle smooth whiten skin
1235,3,lussoliv stereo stylus needle vinyl lp usb turntable turnplate black suitable model include ec b ec b ec l ec p etc turntable player great substitution old stylus turntable turnplate vinyl lp turntable player record player vinyl easy install excellent sound quality strong track ability perfect substitution backup cartridge stylus tip stereo spherical tip size x x mm
6406,0,scientific devices cable float level switch meter cable length size pack cable float level switch meter length float size mm x mm x mm adjustable stopper mm diameter x height spdt no c no micro switch amp vac suitable deg c temperature bar pressure
16995,2,fashion care women s royal crepe skirt kcbc multicolour shop wide range ethnic bottom wear fashion care amazon pair gorgeous kurta complete look
14982,0,vinod cookware breman sauce pot litres make yummy curry simmer away delicious soup boil vegetables sauce pot come convenience temper glass lid cook away serve
17804,1,write practice book set hindi inikao early learn set book hindi write practice early learn stage cover basic character startup word hindi bundle hindi aksharmala


<IPython.core.display.Javascript object>

In [45]:
labels = creating_data_["Label"]
from sklearn.utils import class_weight  # To balance an unbalanced dataset

labelList = labels.unique()
print(labelList)
class_weights = class_weight.compute_class_weight(
    class_weight="balanced", classes=np.array(labelList), y=labels.values.reshape(-1)
)
class_weights = dict(zip(np.array(range(len(labelList))), class_weights))
# print calculated class weights
class_weights


[0 2 3 1]


{0: 0.6580043548234403,
 1: 1.2247577092511013,
 2: 1.309438583270535,
 3: 1.1110134271099745}

<IPython.core.display.Javascript object>

In [46]:
type(creating_data_)

pandas.core.frame.DataFrame

<IPython.core.display.Javascript object>

In [47]:
def myProcessingSteps(df):
    df = pd.DataFrame(df)
    df["Text"] = df.apply(lambda row: normalize_3(str(row["Text"])), axis=1)
    bow_vec = CountVectorizer(max_features=2000)
    bow_vec.fit_transform(
        creating_data_["Text"]
    )  # fit  and transform only the original training set
    data_features = bow_vec.transform(
        df["Text"]
    )  # Only transform the new testing set don't fit
    data_features = data_features.toarray()
    return data_features

<IPython.core.display.Javascript object>

In [48]:
myProcessingSteps(data)


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

<IPython.core.display.Javascript object>

In [49]:
# The function created for processing the data should be passed as an arugument in the FunctionTransformer
processing = FunctionTransformer(myProcessingSteps)

<IPython.core.display.Javascript object>

In [50]:
pipe_1 = Pipeline(
    steps=[        

        ("data_processing", processing),
        ("RF", RandomForestClassifier(n_estimators=19, class_weight=class_weights)),
    ]
)



<IPython.core.display.Javascript object>

In [51]:
pipe_2 = Pipeline(
    steps=[
        ("data_processing", processing),
        (
            "XGB",
           XGBClassifier(
                random_state=1,
                n_estimators=100,
                subsample=0.8,
                learning_rate=0.2,
                max_depth=4,
            ),
        ),
    ]
)


<IPython.core.display.Javascript object>

In [52]:
pipe_4 = Pipeline(
    steps=[
        ("data_processing", processing),
        (
            "GBC",
            GradientBoostingClassifier(
                random_state=1,
                n_estimators=100,
                subsample=0.8,
                learning_rate=0.2,
                max_depth=4,
            ),
        ),
    ]
)


<IPython.core.display.Javascript object>

In [53]:
data = pd.read_csv("ecommerce_dataset.csv")
data = process_main_data(data, "Label")
# data.head()

<IPython.core.display.Javascript object>

In [54]:
y = data.iloc[:, :-1]
X = data.iloc[:, -1]
# y.head()

<IPython.core.display.Javascript object>

In [55]:
X = pd.DataFrame(X, columns=["Text"])
y = pd.DataFrame(y, columns=["Label"])

y["Label"].replace(
    {"Household": 0, "Books": 1, "Clothing & Accessories": 2, "Electronics": 3},
    inplace=True,
)
# X.head()

<IPython.core.display.Javascript object>

In [56]:
# Split data into training and testing set for (count)vectorized data..

from sklearn.model_selection import train_test_split

X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

<IPython.core.display.Javascript object>

In [57]:
pipe_1.fit(X_train_4, y_train_4)  # Random Forest Model

Pipeline(steps=[('data_processing',
                 FunctionTransformer(func=<function myProcessingSteps at 0x000001FA3D50AC10>)),
                ('RF',
                 RandomForestClassifier(class_weight={0: 0.6580043548234403,
                                                      1: 1.2247577092511013,
                                                      2: 1.309438583270535,
                                                      3: 1.1110134271099745},
                                        n_estimators=19))])

<IPython.core.display.Javascript object>

In [58]:
pipe_2.fit(X_train_4, y_train_4)  # XGBoost Model



Pipeline(steps=[('data_processing',
                 FunctionTransformer(func=<function myProcessingSteps at 0x000001FA3D50AC10>)),
                ('XGB',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               gamma=0, gpu_id=-1, importance_type=None,
                               interaction_constraints='', learning_rate=0.2,
                               max_delta_step=0, max_depth=4,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=8, num_parallel_tree=1,
                               objective='multi:softprob', predictor='auto',
                               random_state=1, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=None, subsample=

<IPython.core.display.Javascript object>

In [61]:
pipe_1.predict(X_test_4)

array([0, 0, 0, ..., 3, 1, 0], dtype=int64)

<IPython.core.display.Javascript object>

In [62]:
pipe_2.predict(X_test_4)

array([0, 0, 0, ..., 3, 1, 0], dtype=int64)

<IPython.core.display.Javascript object>

In [65]:
pipe_1.score(X_test_4, y_test_4)

0.9202733485193622

<IPython.core.display.Javascript object>

In [66]:
pipe_2.score(X_test_4, y_test_4)

0.9212324661311594

<IPython.core.display.Javascript object>

In [68]:
pipe_1.predict(data)

array([0, 0, 0, ..., 3, 0, 2], dtype=int64)

<IPython.core.display.Javascript object>

In [69]:
pipe_2.predict(data)

array([0, 0, 0, ..., 3, 0, 2], dtype=int64)

<IPython.core.display.Javascript object>

In [71]:
pipe_4.fit(X_train_4, y_train_4)  # Gradient Boosting Model

Pipeline(steps=[('data_processing',
                 FunctionTransformer(func=<function myProcessingSteps at 0x000001FA3D50AC10>)),
                ('GBC',
                 GradientBoostingClassifier(learning_rate=0.2, max_depth=4,
                                            random_state=1, subsample=0.8))])

<IPython.core.display.Javascript object>

In [72]:
pipe_4.predict(X_test_4)

array([0, 0, 0, ..., 3, 1, 0], dtype=int64)

<IPython.core.display.Javascript object>

In [73]:
pipe_4.score(X_test_4, y_test_4)

0.9233904807577029

<IPython.core.display.Javascript object>

In [74]:
pipe_4.predict(data)

array([0, 0, 0, ..., 3, 0, 2], dtype=int64)

<IPython.core.display.Javascript object>

In [None]:
# The best model so far is the GradientBoostingClassifier. So, that will be the model for production in this case

In [75]:
import pickle

<IPython.core.display.Javascript object>

In [76]:
pickle.dump(pipe_4, open("nlp_diff_gbc.pkl", "wb"))

<IPython.core.display.Javascript object>

In [77]:
model_nlp_gbc_pk = pickle.load(open("nlp_diff_gbc.pkl", "rb"))

<IPython.core.display.Javascript object>

In [78]:
model_nlp_gbc_pk.predict(data)

array([0, 0, 0, ..., 3, 0, 2], dtype=int64)

<IPython.core.display.Javascript object>