# Project Notebook: Text Identification with FastText
FastText is a popular open-source, free, lightweight library that allows users to learn text representations and perform text classification tasks efficiently. It was developed by Facebook's AI Research (FAIR) lab. FastText is known for its speed and ability to handle large datasets.<p>

FastText primarily supports supervised learning, where a model is trained on labeled data to make predictions on new, unseen data. However, it doesn't have built-in functionality for unsupervised learning. In unsupervised learning, the algorithm doesn't have labeled data and aims to find patterns or representations within the data.<p>

That said, you can use FastText for unsupervised tasks by leveraging its ability to learn word embeddings. The unsupervised approach often involves training FastText on a large corpus to learn word representations and then using those embeddings for downstream tasks or analysis.<p>
    
    Fasttext is pre trained word vectors on 'Common Crawl' and 'Wikipedia'. We can learn more about fasttext on https://fasttext.cc/docs/en/crawl-vectors.html
 


## Problem Statement

In this notebook, we embark on two distinct projects utilizing FastText - supervised and unsupervised machine learning for text identification. The primary objective is to leverage FastText's capabilities in capturing semantic relationships among words through unsupervised learning, and subsequently, employing it for product category prediction through supervised learning.

## Dataset

We will be working with the "E-commerce Text Classification" dataset sourced from Kaggle [here](https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification/data). This dataset contains product descriptions and corresponding categories, serving as the foundation for our exploration.

## Projects Overview

### Unsupervised Learning:
Our first project involves training a FastText model in an unsupervised manner to capture word similarities within our dataset. The goal is to generate word embeddings that encode semantic relationships, enabling us to measure the similarity between different words.

### Supervised Learning:
The second project focuses on supervised learning, where we train a FastText model to predict product categories based on their descriptions. We preprocess the data, split it into training and testing sets, and evaluate the model's performance using relevant metrics.

## Steps:

1. **Data Exploration:**
   - Explore the structure and characteristics of the provided dataset.

2. **Unsupervised Learning:**
   - Train a FastText model in an unsupervised manner to capture word similarities.

3. **Supervised Learning:**
   - Prepare the data for supervised learning by preprocessing and splitting into training and testing sets.
   - Train a FastText model for product category prediction.

4. **Model Evaluation:**
   - Assess the performance of the supervised learning model using relevant metrics.

5. **Conclusion and Further Exploration:**
   - Summarize the findings and propose potential avenues for further exploration or improvement.

Through these projects, we aim to demonstrate the versatility of FastText in both unsupervised and supervised contexts for effective text identification and classification.


In [30]:
#pip install fasttext
#pip install fasttext-wheel


# Unsupervised Model Training 
In unsupervised learning with FastText, the model is trained on unlabeled data to learn meaningful representations of words or phrases. FastText employs techniques such as skip-gram to capture semantic relationships and similarities between words. This unsupervised approach is valuable for tasks like word embedding generation, semantic similarity analysis, and clustering, where the model extracts patterns and structures from data without relying on predefined labels.


<p>

In this section we will first load the fasttext model then train it on our ecomerce dataset so that the model can better understand the words based on our data.

### loading fasttext model

In [1]:
import fasttext

In [2]:
model_en = fasttext.load_model('G:\\2024\\NLP\\cc.en.300.bin\\cc.en.300.bin') # fastText wikipedia dataset

In seeking to identify words that share similarity or a strong semantic relationship with a specific term, such as 'table,' we leverage the concept of word vectors obtained through the application of FastText. By training an unsupervised FastText model on a relevant text corpus, we generate vector representations for words. These vectors encapsulate semantic information, allowing us to quantify the similarity between words.<p>

Upon model training, the get_nearest_neighbors function facilitates the retrieval of words most closely associated with the target term, 'table,' based on the learned word embeddings. The resulting output provides a list of words alongside their respective similarity scores, highlighting those with the highest semantic affinity to the queried term.

In [60]:
model_en.get_nearest_neighbors('table') # kheer is a south asian sweet 

[(0.8093202114105225, 'tables'),
 (0.7205384969711304, 'table.This'),
 (0.7094437479972839, 'table.The'),
 (0.7015632390975952, 'table.'),
 (0.6971700191497803, 'table.But'),
 (0.6922351717948914, 'table.So'),
 (0.6912224292755127, 'table-'),
 (0.6758333444595337, 'table.I'),
 (0.6712210774421692, 'table.When'),
 (0.6669521927833557, 'tabel')]


In this section, we explore the concept of word similarity using the FastText model. The `get_nearest_neighbors` method in FastText allows us to identify words that exhibit the closest semantic relationships to a specified term. The similarity between words is computed through the cosine similarity metric applied to the learned word vectors within the model.<p> Now we will train this model based on our dataset and check the similarity again after training.


In [4]:
#dir(model_en)

In [8]:
#help(model_en.get_analogies)

Help on method get_analogies in module fasttext.FastText:

get_analogies(wordA, wordB, wordC, k=10, on_unicode_error='strict') method of fasttext.FastText._FastText instance



In [5]:
#The method tries to find a word that is related to 'driver' in the same way as 'sailor' is related to 'ship'.
model_en.get_analogies("ship ","sailor ","drive")

[(0.5573194622993469, 'drives'),
 (0.5201934576034546, 'dirve'),
 (0.49024200439453125, 'drive.I'),
 (0.48620370030403137, 'drive.It'),
 (0.4767036736011505, 'drive.You'),
 (0.4736213684082031, 'drive.This'),
 (0.47208210825920105, 'drive.As'),
 (0.46843627095222473, 'drive.The'),
 (0.46421313285827637, 'drive.So'),
 (0.46390312910079956, 'drive.And')]

In [18]:
model_en.get_dimension()

300

## Unsupervised Learning: FastText Model Exploration

In this section, we delve into unsupervised learning by training a FastText model on e-commerce data. Our objective is to assess the model's proficiency in identifying specific words based on its pre-training. Initially, we examine the results of a pre-trained FastText model to obtain similar words related to 'table.' Utilizing the `get_nearest_neighbors` method, we retrieve the nearest vectors of the term 'table' from the model. Subsequently, we plan to reevaluate this analysis after training the model on our specific e-commerce dataset, aiming to observe potential improvements and domain-specific word relationships.


In [61]:
nearest_neighbors_model_en = model_en.get_nearest_neighbors('table')

In [62]:
df_model_en = pd.DataFrame(nearest_neighbors_model_en, columns=['Similarity', 'Word'])
df_model_en

Unnamed: 0,Similarity,Word
0,0.80932,tables
1,0.720538,table.This
2,0.709444,table.The
3,0.701563,table.
4,0.69717,table.But
5,0.692235,table.So
6,0.691222,table-
7,0.675833,table.I
8,0.671221,table.When
9,0.666952,tabel


In [63]:
# we can also get the id of any specific word
model_en.get_label_id('table')

-1998892

### Data Loading

In [64]:
ecommerce= pd.read_csv("ecommerce_dataset.csv",  names=["category", "description"],header=None)
print(ecommerce.shape)
ecommerce.head(3)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [74]:
ecommerce.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50425 entries, 0 to 50424
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   category     50425 non-null  object
 1   description  50425 non-null  object
dtypes: object(2)
memory usage: 788.0+ KB


In [76]:
ecommerce.dropna(inplace=True)
ecommerce.shape

(50425, 2)

In [65]:
# see the instructions of the recipe
ecommerce.description[0]

'Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and some for eternal blis

### preprocessing

In [94]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import re

def preprocess(text):
    if isinstance(text, str):
        text = re.sub(r'[^\w\s\']', ' ', text)
        text = re.sub(' +', ' ', text)
        return text.strip().lower()
    else:
        return str(text)

Now we will apply the preprocess function of data instructions column. FastText expects its input data to be in the form of a text file where each line represents a document or a piece of text.

In [69]:
# Apply preprocessing to TranslatedInstructions column
ecommerce['description'] = ecommerce['description'] .map(preprocess)

# Save the preprocessed data to a text file
ecommerce.to_csv("description.txt", columns=["description"], header=None, index=False)

# Train a FastText model on the preprocessed data
model1 = fasttext.train_unsupervised("description.txt")

# Get the nearest neighbors for the word "table" in the trained model
nearest_neighbors = model1.get_nearest_neighbors("table")

# Print the nearest neighbors
print(nearest_neighbors)

[(0.7381492257118225, 'tablecloth'), (0.7243741750717163, 'dining'), (0.7216840982437134, 'tablecloths'), (0.7159408926963806, 'tables'), (0.6671674251556396, 'placemats'), (0.6631715297698975, 'bedside'), (0.6451259255409241, 'placemat'), (0.637414813041687, 'nestable'), (0.6306925415992737, 'tiltable'), (0.6303355693817139, 'dinning')]


For detailed information on the parameters used in the `train_unsupervised` function in FastText, please refer to the official documentation [here](https://fasttext.cc/docs/en/unsupervised-tutorial.html).

When fine-tuning the model based on specific requirements, consider the following key parameters:

- **`epochs`**: Default value is 5. It determines how many times the model will loop over the same dataset during training, influencing the depth of learning.

- **`lr`**: Learning rate. This parameter governs the step size at each iteration while moving towards a minimum of the loss function, impacting the convergence speed and stability.

- **`thread`**: Number of threads for training. It defines the parallelism during training, affecting the efficiency of the process.

Customizing these parameters allows for fine-tuning the FastText model to better suit the characteristics and nuances of the specific dataset, enhancing its learning and predictive capabilities.



In [70]:
# getting the similaries after training
nearest_neighbors_model1 = model1.get_nearest_neighbors('table')
df_model1 = pd.DataFrame(nearest_neighbors_model1, columns=['Similarity', 'Word'])
df_model1

Unnamed: 0,Similarity,Word
0,0.738149,tablecloth
1,0.724374,dining
2,0.721684,tablecloths
3,0.715941,tables
4,0.667167,placemats
5,0.663172,bedside
6,0.645126,placemat
7,0.637415,nestable
8,0.630693,tiltable
9,0.630336,dinning


### Comparing
Now comparing the result before and after training the model we can get

In [75]:
df_comparison = pd.merge(df_model_en, df_model1, on='Word', suffixes=('_model_en(before)', '_model(after)'), how='outer')

# Print or display the comparison DataFrame
print(df_comparison)

    Similarity_model_en(before)         Word  Similarity_model(after)
0                      0.809320       tables                 0.715941
1                      0.720538   table.This                      NaN
2                      0.709444    table.The                      NaN
3                      0.701563       table.                      NaN
4                      0.697170    table.But                      NaN
5                      0.692235     table.So                      NaN
6                      0.691222       table-                      NaN
7                      0.675833      table.I                      NaN
8                      0.671221   table.When                      NaN
9                      0.666952        tabel                      NaN
10                          NaN   tablecloth                 0.738149
11                          NaN       dining                 0.724374
12                          NaN  tablecloths                 0.721684
13                  

### Analysis of Word Similarity Results

The table presents the results of word similarity before and after preprocessing using the FastText model. In the 'Similarity_model_en' column, initial similarity scores demonstrate relationships with the word 'tables' and variations of the word 'table.' After preprocessing, the 'Similarity_model' column reveals new associations, highlighting similar terms like 'tablecloth,' 'dining,' and 'placemats.' The transformation suggests improved semantic representations and a broader understanding of related words in the context of the dataset. Further exploration and fine-tuning of preprocessing techniques may enhance the model's ability to capture nuanced similarities.


# Supervised Model Training
In supervised learning with FastText, the model is trained on a labeled dataset, where each text sample is associated with a predefined category or label. FastText efficiently learns text representations by considering subword information, allowing it to capture morphological and semantic features. The trained model can then classify new, unseen text samples into predefined categories based on the learned representations, making it suitable for tasks such as text classification, sentiment analysis, and topic categorization.<p>
    
Here, We are using ecommerce dataset to predict the label/class of a item where it belongs.

In [88]:
ecommerce= pd.read_csv("ecommerce_dataset.csv",  names=["category", "description"],header=None)
print(ecommerce.shape)
ecommerce.head(3)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


There is no nan values in the dataset.

In [89]:
ecommerce.description[0]

'Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and some for eternal blis

In [90]:
# CHECKING THE LABEL COUNT
ecommerce.category.value_counts()

category
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: count, dtype: int64

## Formatting & Preprocessing
In FastText, adding a prefix like 'label' to each category and combining it with the text is a formatting requirement for supervised text classification. This prefix helps FastText recognize and distinguish between the category labels and the actual text during training. The combined format ensures that the model understands the association between the provided labels and their corresponding textual descriptions, allowing it to learn and make accurate predictions on new, unseen data.

In [91]:
ecommerce.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)
ecommerce['category'] = '__label__' + ecommerce['category'].astype(str)
ecommerce['category_description'] = ecommerce['category'] + ' ' + ecommerce['description']
ecommerce.head(3)


Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


In [92]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower() 

In [95]:
def preprocess(text):
    if isinstance(text, str):
        text = re.sub(r'[^\w\s\']', ' ', text)
        text = re.sub(' +', ' ', text)
        return text.strip().lower()
    else:
        return str(text)

In [96]:
ecommerce['category_description'] = ecommerce['category_description'].map(preprocess)
ecommerce.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


## training the data

In [97]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(ecommerce, test_size=0.2) # 20% test, 80% train

In [98]:
train.shape, test.shape

((40340, 3), (10085, 3))

In [99]:
# saving the train and test data on the local dir
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

In [100]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(10084, 0.9686632288774296, 0.9686632288774296)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96.8% precision which is pretty good. The precision value indicates the percentage of correctly predicted positive instances among all instances predicted as positive, while recall represents the percentage of correctly predicted positive instances among all actual positive instances. In this case, both precision and recall are approximately 96.84%, suggesting a high accuracy in predicting the correct categories for the given product descriptions.<br>

Overall, the evaluation metrics reflect a robust performance of the FastText model in classifying product categories based on the provided textual data.

In [102]:
household="Modern LED Table Lamp with Adjustable Brightness and Color Temperature. This sleek and stylish lamp is perfect for illuminating your home office or study area. It features adjustable brightness settings and color temperature control, allowing you to create the perfect lighting environment for any task. The modern design and energy-efficient LED technology make it a great addition to your household."
cloth="Fashionable Men's Casual Slim Fit Blazer Jacket. Elevate your style with this trendy blazer that combines sophistication with comfort. Made from high-quality materials, it offers a slim fit and modern design, making it a perfect choice for various occasions. Stay in vogue with this versatile clothing item that complements your wardrobe and enhances your overall look."
electronics="Smart Home Security Camera System with HD Video Monitoring. Ensure the safety of your home with this advanced security camera system. Featuring high-definition video monitoring and smart connectivity, it allows you to keep an eye on your property remotely. Easy installation and user-friendly controls make it an essential electronic gadget for modern homes focused on security and convenience."
book="Bestselling Mystery Novel - Whispers in the Shadows. Dive into the captivating world of suspense and intrigue with this bestselling mystery novel. Written by a renowned author, the book weaves a gripping tale of unexpected twists and turns. Immerse yourself in the thrilling narrative that keeps you on the edge of your seat, making it a must-read for book enthusiasts and mystery lovers alike."



In [104]:
model.predict("Lord of the rings")


(('__label__books',), array([0.99954009]))

In [105]:
model.predict(household)

(('__label__household',), array([0.9987278]))

In [106]:
model.predict(cloth)

(('__label__clothing_accessories',), array([0.85875785]))

In [107]:
model.predict(electronics)

(('__label__household',), array([0.9804548]))

In [108]:
model.predict(book)

(('__label__books',), array([0.9899534]))

# Conclusion

In this project, we explored the capabilities of FastText for both unsupervised and supervised machine learning tasks in the context of text identification. FastText demonstrated its effectiveness in capturing word similarities and learning meaningful representations through unsupervised training. Additionally, we utilized a pre-trained FastText model on a generic dataset and later fine-tuned it on an e-commerce dataset for specific applications.

### Unsupervised Learning:
We initially examined the word similarities based on a pre-trained FastText model, specifically focusing on the word 'table.' The model successfully identified similar words, such as 'tables' and 'tablecloth,' showcasing its ability to capture semantic relationships. After training the model on our e-commerce dataset, we observed improvements, with the model now associating words like 'dining' and 'placemats,' indicating a better understanding of domain-specific language.

### Supervised Learning:
In the supervised learning phase, we employed FastText to predict product categories based on their descriptions. The model achieved a high precision of approximately 96.8%, indicating its accuracy in categorizing products. Examples of predicting categories for input texts further demonstrated the model's capability to assign relevant labels.

## Areas for Improvement:
1. **Hyperparameter Tuning:** Experimenting with different hyperparameter settings during model training can significantly impact performance. Parameters such as epochs, learning rate, and the number of dimensions in word vectors could be fine-tuned for optimal results.

2. **Dataset Size:** Increasing the size of the dataset for supervised learning could potentially improve model generalization. A larger and more diverse dataset may capture a broader range of language nuances and improve the model's ability to handle various product categories.

3. **Model Evaluation:** While our model achieved high precision, exploring additional evaluation metrics like recall, F1 score, or utilizing techniques like cross-validation can provide a more comprehensive understanding of the model's overall performance.

4. **Domain-Specific Embeddings:** Consider exploring domain-specific pre-trained embeddings or training embeddings on a larger and more relevant e-commerce dataset. This may further enhance the model's understanding of domain-specific language.

Continuing to iterate on these aspects and experimenting with different configurations will likely lead to an even more robust and accurate text identification model.