<a href="https://colab.research.google.com/github/abdullah1234-bit/NLP-/blob/main/fasttext_with_supervised_trained_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Text Classification Using FastText</h3>

##### Dataset Credits: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

We have a dataset of ecommerce item description. Total 4 categories,
1. Household
2. Electronics
3. Clothing and Accessories
4. Books

The task at hand is to classify a product into one of the above 4 categories based on the product description

In [2]:
import pandas as pd

df= pd.read_csv("/content/ecommerceDataset.csv", names=["category", "description"], header=None)
print(df.shape)
df.head(3)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


**Drop NA values**

In [3]:
df.dropna(inplace=True)
df.shape

(50424, 2)

In [4]:
df.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [5]:
df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)


In [6]:
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

When you train a fasttext model, it expects labels to be specified with __label__ prefix. We will just create a third column in the dataframe that has __label__ as well as the product description

In [7]:
df['category'] = '__label__' + df['category'].astype(str)
df.head(5)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [8]:
df['category_description'] = df['category'] + ' ' + df['description']
df.head(3)

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


**Pre-procesing**
1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [11]:
!pip install spacy
!python -m spacy download en_core_web_lg


Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [12]:
import spacy

# Load SpaCy's large pre-trained model
nlp = spacy.load("en_core_web_lg")

# Example text
text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"

# Preprocess the text with spaCy
def clean_text_with_spacy(text):
    # Parse the text
    doc = nlp(text)

    # Keep only alphanumeric words, spaces, and single quotes
    cleaned_tokens = [token.text for token in doc if token.is_alpha or token.is_space or token.text == "'"]

    # Join tokens and process whitespace
    cleaned_text = " ".join(cleaned_tokens).strip().lower()
    return cleaned_text

# Clean the text
cleaned_text = clean_text_with_spacy(text)
print(cleaned_text)


viki bookcase bookshelf shelf shelve white hi


In [13]:
import spacy

# Load SpaCy's large pre-trained model
nlp = spacy.load("en_core_web_lg")

def preprocess_with_spacy(text):
    # Process the text using SpaCy
    doc = nlp(text)

    # Filter tokens: Keep words, spaces, and single quotes
    cleaned_tokens = [token.text for token in doc if token.is_alpha or token.is_space or token.text == "'"]

    # Join tokens, strip extra spaces, and convert to lowercase
    return " ".join(cleaned_tokens).strip().lower()


In [14]:
# Assuming `preprocess_with_spacy` is already defined as in the previous response

# Apply the SpaCy-based preprocessing to the 'category_description' column
df['category_description'] = df['category_description'].apply(lambda x: preprocess_with_spacy(x))

df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,paper plane design framed wall hanging motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",saf ' floral ' framed painting wood inch x inc...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,saf ' uv textured modern art print framed ' pa...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",saf flower print framed painting synthetic inc...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,incredible gifts india wooden happy birthday u...


**Train Test Split**

In [15]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [None]:
train.shape, test.shape

((40339, 3), (10085, 3))

In [25]:
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

**Train the model and evaluate performance**

In [17]:
!pip install fasttext


Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp310-cp310-linux_x86_64.whl size=4296183 sha256=b49e7583257ac6efff89d19851f7c98ad928d8c45a4060729279692814b0d91c
  Stored in directory: /root/.cache/pip/wheels/0d/a2/00/81db54d3e6a8199b829d58

In [26]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(0, nan, nan)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96% precision which is pretty good

**Now let's do prediction for few product descriptions**

In [22]:
model.predict("wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3")

((), array([], dtype=float64))

In [None]:
model.predict("ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric")

(('__label__clothing_accessories',), array([1.00001001]))

In [None]:
model.predict("think and grow rich deluxe edition")

(('__label__books',), array([1.00000966]))

In [None]:
model.get_nearest_neighbors("painting")

[(0.9961369037628174, 'vacuum'),
 (0.9959887862205505, 'guard'),
 (0.9959416389465332, 'alarm'),
 (0.9958606958389282, 'lint'),
 (0.9955978989601135, 'temperature'),
 (0.995332658290863, 'machine'),
 (0.9952465295791626, 'gloss'),
 (0.9952084422111511, 'door'),
 (0.9951741099357605, 'steam'),
 (0.9947825074195862, 'induction')]

In [None]:
model.get_nearest_neighbors("sony")

[(0.9984563589096069, 'tablets'),
 (0.9980253577232361, 'dvd'),
 (0.9979312419891357, 'binocular'),
 (0.9977967739105225, 'colour'),
 (0.9977490305900574, 'external'),
 (0.9976783394813538, 'player'),
 (0.9972286820411682, 'viewing'),
 (0.9971092939376831, 'photos'),
 (0.9967519640922546, 'binoculars'),
 (0.996209979057312, 'graphics')]

In [None]:
model.get_nearest_neighbors("banglore")

[(0.0, 'to'),
 (0.0, 'and'),
 (0.0, 'a'),
 (0.0, 'with'),
 (0.0, 'for'),
 (0.0, 'is'),
 (0.0, '</s>'),
 (0.0, 'spaulding'),
 (0.0, 'audette'),
 (0.0, 'rheumatologist')]