# Ecommerce Text Classification Using FastText

![Ecommerce Text Classification Using FastText](cover__photo.png)

## Importing Libraries

In [40]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import fasttext
import spacy

import warnings
warnings.filterwarnings("ignore")

## Loading Dataset

In [41]:
df = pd.read_csv("ecommerce_dataset.csv", names=["category", "description"], header=None)
df.head()

Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,Household,Incredible Gifts India Wooden Happy Birthday U...


## Data Preprocessing

In [42]:
print("Shape of the dataframe:", df.shape)

Shape of the dataframe: (50425, 2)


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50425 entries, 0 to 50424
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   category     50425 non-null  object
 1   description  50424 non-null  object
dtypes: object(2)
memory usage: 788.0+ KB


In [44]:
null_values = df.isna().sum()
print("Null values:\n\n", null_values)

Null values:

 category       0
description    1
dtype: int64


In [45]:
df.dropna(inplace=True)

In [46]:
df.isna().sum()

category       0
description    0
dtype: int64

In [47]:
df['description'].duplicated().sum()

22622

In [48]:
duplicates = df[df.duplicated()]
duplicates.head()

Unnamed: 0,category,description
7,Household,Pitaara Box Romantic Venice Canvas Painting 6m...
11,Household,Paper Plane Design Starry Night Vangoh Wall Ar...
12,Household,Pitaara Box Romantic Venice Canvas Painting 6m...
16,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
20,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [49]:
df = df.drop_duplicates()

In [50]:
print("Shape of the dataframe:", df.shape)
print("Duplicates:", df.duplicated().sum())

Shape of the dataframe: (27802, 2)
Duplicates: 0


In [51]:
df['category'].value_counts()

category
Household                 10564
Books                      6256
Clothing & Accessories     5674
Electronics                5308
Name: count, dtype: int64

In [52]:
df['category'].unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [53]:
df.category.replace("Clothing & Accessories", "Clothing_and_Accessories", inplace=True)

In [54]:
df.category.unique()

array(['Household', 'Books', 'Clothing_and_Accessories', 'Electronics'],
      dtype=object)

In [55]:
print(df["description"][1])
print("-------------------")
print(len(df["description"][1]))

SAF 'Floral' Framed Painting (Wood, 30 inch x 10 inch, Special Effect UV Print Textured, SAO297) Painting made up in synthetic frame with UV textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch (A perfect gift for your special ones).
-------------------
346


In [57]:
# Load English model
nlp = spacy.load("en_core_web_sm")

def prepocess_text(text):
    doc = nlp(text)
    # Keep only non-stop, non-punctuation, non-space tokens and lemmatize
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct and not token.is_space]
    return " ".join(tokens)

In [None]:
df["description"] = df["description"].apply(prepocess_text)
df.head()

Unnamed: 0,category,description
0,Household,paper plane design framed wall hanging motivat...
1,Household,saf floral framed painting wood 30 inch x 10 i...
2,Household,saf uv textured modern art print framed painti...
3,Household,saf flower print framed painting synthetic 13....
4,Household,incredible gifts india wooden happy birthday u...


In [60]:
print(df["description"][1])
print("-------------------")
print(len(df["description"][1]))

saf floral framed painting wood 30 inch x 10 inch special effect uv print textured sao297 painting synthetic frame uv texture print give multi effect attract special series painting make wall beautiful give royal touch perfect gift special one
-------------------
243


In [61]:
df["category"] = "__label__" + df["category"].astype(str)
df.head()

Unnamed: 0,category,description
0,__label__Household,paper plane design framed wall hanging motivat...
1,__label__Household,saf floral framed painting wood 30 inch x 10 i...
2,__label__Household,saf uv textured modern art print framed painti...
3,__label__Household,saf flower print framed painting synthetic 13....
4,__label__Household,incredible gifts india wooden happy birthday u...


In [62]:
df["category_description"] = df["category"] + " " + df["description"]
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,paper plane design framed wall hanging motivat...,__label__Household paper plane design framed w...
1,__label__Household,saf floral framed painting wood 30 inch x 10 i...,__label__Household saf floral framed painting ...
2,__label__Household,saf uv textured modern art print framed painti...,__label__Household saf uv textured modern art ...
3,__label__Household,saf flower print framed painting synthetic 13....,__label__Household saf flower print framed pai...
4,__label__Household,incredible gifts india wooden happy birthday u...,__label__Household incredible gifts india wood...


## Train Test Split

In [63]:
train, test = train_test_split(df.category_description, test_size=0.2, random_state=42)

In [64]:
train.shape, test.shape

((22241,), (5561,))

In [65]:
train.to_csv("train.txt", index=False, header=False)
test.to_csv("test.txt", index=False, header=False)

In [66]:
model = fasttext.train_supervised(input="train.txt")
model.test("test.txt")

(5219, 0.9557386472504311, 0.9557386472504311)

First parameter (5219) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 95% precision which is pretty good

## Prediction

In [67]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__Electronics',), array([0.99931395]))

In [68]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__Clothing_and_Accessories',), array([0.99969697]))

In [69]:
model.predict("think and grow rich deluxe edition")

(('__label__Books',), array([1.00000274]))

In [70]:
model.get_nearest_neighbors("painting")

[(0.9964516162872314, 'diffusers'),
 (0.9964451789855957, 'products-'),
 (0.9964441657066345, '.easy'),
 (0.9964431524276733, 'name:8x6'),
 (0.9964431524276733, 'tarps'),
 (0.9964431524276733, 'grommets'),
 (0.9964431524276733, 'weatherspread'),
 (0.9964399337768555, 'orthospre'),
 (0.9964399337768555, 'catnap'),
 (0.9964399337768555, 'bonnel')]

In [71]:
model.get_nearest_neighbors("sony")

[(0.9985525012016296, 'scrutinize'),
 (0.9985525012016296, 'bt50b'),
 (0.9985525012016296, 'clipping'),
 (0.9985525012016296, 'connectively'),
 (0.998546302318573, 'imager'),
 (0.998546302318573, '2d/1d'),
 (0.998546302318573, 'ds2208'),
 (0.9985334277153015, 'profiles'),
 (0.9985076785087585, '4.2.0'),
 (0.9985076785087585, 'x70fx')]

In [72]:
model.get_nearest_neighbors("banglore")

[(0.0, 'product'),
 (0.0, 'design'),
 (0.0, 'set'),
 (0.0, 'x'),
 (0.0, 'use'),
 (0.0, '1'),
 (0.0, 'high'),
 (0.0, '128gb(max'),
 (0.0, 'iflashdrive'),
 (0.0, 'sd+tf')]