# Text Classification Using fastText

We have a dataset of ecommerce item description. Total 4 categories,

1- Household<br>
2- Electronics<br>
3- Clothing and Accessories<br>
4- Books<br>
The task at hand is to classify a product into one of the above 4 categories based on the product description

In [32]:
df = pd.read_csv("/content/Ecommerce_data.csv", names=["description", "category"], header=None, skiprows=1)
df.head()

Unnamed: 0,description,category
0,Urban Ladder Eisner Low Back Study-Office Comp...,Household
1,"Contrast living Wooden Decorative Box,Painted ...",Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,Clothing & Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,Clothing & Accessories


In [33]:
df.shape

(24000, 2)

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24000 entries, 0 to 23999
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   description  24000 non-null  object
 1   category     24000 non-null  object
dtypes: object(2)
memory usage: 375.1+ KB


In [35]:
df.dtypes

Unnamed: 0,0
description,object
category,object


In [36]:
df.isna().sum() # There is no Nan value in this dataset

Unnamed: 0,0
description,0
category,0


In [37]:
df.category.unique()

array(['Household', 'Electronics', 'Clothing & Accessories', 'Books'],
      dtype=object)

In [38]:
# Replace & whith _
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)


When you train a fasttext model, it expects labels to be specified with label prefix. We will just create a third column in the dataframe that has label as well as the product description



In [39]:
df['category'] = "__label__" +df['category'].astype(str)
df.head()

Unnamed: 0,description,category
0,Urban Ladder Eisner Low Back Study-Office Comp...,__label__Household
1,"Contrast living Wooden Decorative Box,Painted ...",__label__Household
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,__label__Electronics
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,__label__Clothing_Accessories
4,Indira Designer Women's Art Mysore Silk Saree ...,__label__Clothing_Accessories


In [40]:
df['category_desciption'] = df['category']+' '+df['description']
df.head()

Unnamed: 0,description,category,category_desciption
0,Urban Ladder Eisner Low Back Study-Office Comp...,__label__Household,__label__Household Urban Ladder Eisner Low Bac...
1,"Contrast living Wooden Decorative Box,Painted ...",__label__Household,__label__Household Contrast living Wooden Deco...
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,__label__Electronics,__label__Electronics IO Crest SY-PCI40010 PCI ...
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,__label__Clothing_Accessories,__label__Clothing_Accessories ISAKAA Baby Sock...
4,Indira Designer Women's Art Mysore Silk Saree ...,__label__Clothing_Accessories,__label__Clothing_Accessories Indira Designer ...


### Pre-processing

1- Remove punctuation<br>
2- Remove extra space <br>
3- Make the entire sentence lower case

In [30]:
import re

def preprocess(text) :
  text = re.sub(r'[^\w\s\']',' ', text)
  text = re.sub(' +', ' ', text)
  return text.strip().lower()

In [42]:
df['category_desciption'] = df['category_desciption'].map(preprocess)
df.head()

Unnamed: 0,description,category,category_desciption
0,Urban Ladder Eisner Low Back Study-Office Comp...,__label__Household,__label__household urban ladder eisner low bac...
1,"Contrast living Wooden Decorative Box,Painted ...",__label__Household,__label__household contrast living wooden deco...
2,IO Crest SY-PCI40010 PCI RAID Host Controller ...,__label__Electronics,__label__electronics io crest sy pci40010 pci ...
3,ISAKAA Baby Socks from Just Born to 8 Years- P...,__label__Clothing_Accessories,__label__clothing_accessories isakaa baby sock...
4,Indira Designer Women's Art Mysore Silk Saree ...,__label__Clothing_Accessories,__label__clothing_accessories indira designer ...


 ### Train Test Split

In [43]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [45]:
train.shape, test.shape

((19200, 3), (4800, 3))

In [48]:
# Save Test and Train data
train.to_csv("ecommerce.train", columns=["category_desciption"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_desciption"], index=False, header=False)

### Train the model and evaluate performance

In [None]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext


In [59]:
import fasttext

model = fasttext.train_supervised(input="/content/ecommerce.train")
model.test("/content/ecommerce.test")

(4800, 0.9708333333333333, 0.9708333333333333)

First parameter (4800) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 97% precision which is pretty good


### Now let's do prediction for few product descriptions



In [None]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

In [54]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__clothing_accessories',), array([1.00001001]))

In [55]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([0.99999404]))

In [56]:
model.get_nearest_neighbors("painting")

[(0.9961575865745544, 'ruining'),
 (0.9961335062980652, 'shut'),
 (0.9961050152778625, 'saf'),
 (0.9960540533065796, 'ventilator'),
 (0.9960523247718811, '245'),
 (0.9960342645645142, 'households'),
 (0.996019184589386, 'dengue'),
 (0.996019184589386, 'malaria'),
 (0.996019184589386, 'filaria'),
 (0.9960169196128845, 'chores')]

In [None]:
model.get_nearest_neighbors("good")