# NLP Tutorial: Text Classification Using FastText
Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification
We have a dataset of ecommerce item description. Total 4 categories:

1. Household
2. Electronics
3. Clothing and Accessories
4. Books

The task at hand is to classify a product into one of the above 4 categories based on the product description

In [2]:
import pandas as pd

df= pd.read_csv("ecommerce_dataset.csv", names=["category", "description"], header=None)
print(df.shape)
df.head(3)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [3]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [4]:
df.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [5]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)


In [6]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

When you train a fasttext model, it expects labels to be specified with label prefix. We will just create a third column in the dataframe that has label as well as the product description

In [7]:
df['category'] = '__label__' + df['category'].astype(str)
df.head(5)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [8]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


### Pre-procesing

1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [9]:
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [10]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower()

In [11]:
df['category_description'] = df['category_description'].map(preprocess)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


### Train Test Split

In [12]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [13]:
train

Unnamed: 0,category,description,category_description
24779,__label__Books,"Touching the Void (New Windmills KS4) Review ""...",__label__books touching the void new windmills...
30917,__label__Books,Handbook of Mathematics About the Author An ed...,__label__books handbook of mathematics about t...
23586,__label__Books,Action Shoes Men's Hawaii Thong Sandals,__label__books action shoes men's hawaii thong...
8880,__label__Household,"Philips Hue 9.5W E27 Bulb (White Ambiance), Co...",__label__household philips hue 9 5w e27 bulb w...
5170,__label__Household,Decals Design 'Vine Flower' Wall Sticker (PVC ...,__label__household decals design 'vine flower'...
...,...,...,...
47526,__label__Electronics,Fluval U4 Underwater Filter Style Name:34 to 6...,__label__electronics fluval u4 underwater filt...
15934,__label__Household,"Voltas 2 Ton 3 Star Split AC (Alloy, 243 CZO1,...",__label__household voltas 2 ton 3 star split a...
36037,__label__Clothing_Accessories,Ishin Women's Dress Material (Ddrvmns-2386_Nav...,__label__clothing_accessories ishin women's dr...
11648,__label__Household,Generic Stainless Steel Salad Tongs Serving Sp...,__label__household generic stainless steel sal...


In [14]:
train.shape, test.shape

((40339, 3), (10085, 3))

In [15]:
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

### Train the model and evaluate performance

In [None]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-3.0.1-py3-none-any.whl.metadata (10.0 kB)
Using cached pybind11-3.0.1-py3-none-any.whl (293 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp312-cp312-linux_x86_64.whl size=4498200 sha256=34310404f75383264da9d611a4a3b95d1724f4d869571a17c7f451a7245eaf4e
  Stored in directory: /root/.cache/pip/wheels/20/27/95/a7baf1b435f1cbde017cabd

In [25]:
!pip install numpy==1.26.4

Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jaxlib 0.7.2 requires numpy>=2.0, but you have n

In [16]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(10085, 0.9687654933068914, 0.9687654933068914)

In [17]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

(('__label__electronics',), array([0.99526215]))

In [18]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__clothing_accessories',), array([1.00001001]))

In [19]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([1.00000989]))

In [20]:
model.get_nearest_neighbors("painting")

[(0.9986386895179749, 'lingers'),
 (0.9986371397972107, 'thier'),
 (0.9986371397972107, '462'),
 (0.9986364841461182, 'percolator'),
 (0.9986354112625122, 'perlite'),
 (0.9986319541931152, 'adulterated'),
 (0.9986296892166138, 'designthe'),
 (0.9986203908920288, 'cutoget'),
 (0.9986203908920288, 'showsnotice'),
 (0.9986203908920288, 'x\xa0\xa0exhaust')]

In [23]:
model.get_nearest_neighbors("sony")

[(0.9994041919708252, '18mw'),
 (0.9994041919708252, '20000hz\xa0'),
 (0.9994041919708252, '5v\xa0'),
 (0.9994041919708252, '80db\xa0'),
 (0.9994041919708252, 'hbs730'),
 (0.9994041919708252, '95\xa0'),
 (0.9994041919708252, '35ma\xa0'),
 (0.9994041919708252, '480ghz\xa0'),
 (0.9994041919708252, '2x20mw\xa0'),
 (0.9994041919708252, '10m\xa0')]

In [22]:
model.get_nearest_neighbors("banglore")

[(0.0, 'to'),
 (0.0, 'and'),
 (0.0, 'a'),
 (0.0, 'with'),
 (0.0, 'for'),
 (0.0, 'is'),
 (0.0, '</s>'),
 (0.0, 'orzeck'),
 (0.0, 'panky'),
 (0.0, 'becca')]