# Introduction
FastText is a popular open-source, free, lightweight library that allows users to learn text representations and perform text classification tasks efficiently. It was developed by Facebook's AI Research (FAIR) lab. FastText is known for its speed and ability to handle large datasets.<p>

FastText primarily supports supervised learning, where a model is trained on labeled data to make predictions on new, unseen data. However, it doesn't have built-in functionality for unsupervised learning. In unsupervised learning, the algorithm doesn't have labeled data and aims to find patterns or representations within the data.<p>

That said, you can use FastText for unsupervised tasks by leveraging its ability to learn word embeddings. The unsupervised approach often involves training FastText on a large corpus to learn word representations and then using those embeddings for downstream tasks or analysis.<p>

In this notebook, We will explore two small project on fasttext - supervised and unsupervised machine learning for text identification in fasttext. Fasttext is pre trained word vectors on 'Common Crawl' and 'Wikipedia'. We can learn more about fasttext on https://fasttext.cc/docs/en/crawl-vectors.html

In [30]:
#pip install fasttext
#pip install fasttext-wheel


# Unsupervised Model Training 
In unsupervised learning with FastText, the model is trained on unlabeled data to learn meaningful representations of words or phrases. FastText employs techniques such as skip-gram to capture semantic relationships and similarities between words. This unsupervised approach is valuable for tasks like word embedding generation, semantic similarity analysis, and clustering, where the model extracts patterns and structures from data without relying on predefined labels.


<p>

In this section we will first load the fasttext model then train in on south asian food recipe dataset so that the model can better understand the regional food name or its ingredients. They can 

In [1]:
import fasttext

In [2]:
model_en = fasttext.load_model('G:\\2024\\NLP\\cc.en.300.bin\\cc.en.300.bin') # fastText wikipedia dataset

Let's say we want to know the similar word or the words that are most related to a specific word,say 'kheer'. 'kheer' is a kind of Rice Pudding type south asian cuisine and sweet dish

In [19]:
model_en.get_nearest_neighbors('kheer') # kheer is a south asian sweet 

[(0.856397807598114, 'halwa'),
 (0.8488601446151733, 'payasam'),
 (0.8202382326126099, 'Kheer'),
 (0.7866618633270264, 'rabdi'),
 (0.7792357802391052, 'burfi'),
 (0.7742787599563599, 'kesari'),
 (0.7710885405540466, 'khichdi'),
 (0.7663241028785706, 'laddoo'),
 (0.7605330348014832, 'Payasam'),
 (0.7598305344581604, 'phirni')]

The get_nearest_neighbors method in FastText calculates the nearest neighbors of a given word based on the learned word vectors in the model. The similarity between words is determined by the cosine similarity between their vectors. Here all other names are sweet ingredients of south asian. we can see the closet vectors of the words.

In [4]:
#dir(model_en)

In [8]:
#help(model_en.get_analogies)

Help on method get_analogies in module fasttext.FastText:

get_analogies(wordA, wordB, wordC, k=10, on_unicode_error='strict') method of fasttext.FastText._FastText instance



In [5]:
#The method tries to find a word that is related to 'driver' in the same way as 'sailor' is related to 'ship'.
model_en.get_analogies("ship ","sailor ","drive")

[(0.5573194622993469, 'drives'),
 (0.5201934576034546, 'dirve'),
 (0.49024200439453125, 'drive.I'),
 (0.48620370030403137, 'drive.It'),
 (0.4767036736011505, 'drive.You'),
 (0.4736213684082031, 'drive.This'),
 (0.47208210825920105, 'drive.As'),
 (0.46843627095222473, 'drive.The'),
 (0.46421313285827637, 'drive.So'),
 (0.46390312910079956, 'drive.And')]

In [18]:
model_en.get_dimension()

300

## Unsupervised Learning
In this section, We will train out fasttext model on south asian food recipe. Then we will observe if the model is trained to find specific words based on our model. At first lets examine the results of the pre trained fasttext model to get the similar words of 'payesh'. We can get the nearest vectors of payesh by calling <code>.get_nearest_neighbors</code> on 'payesh', which is also a type of sweet dish. We can check it again after training the model on our dataset.

In [27]:
nearest_neighbors_model_en = model_en.get_nearest_neighbors('payesh')

In [30]:
df_model_en = pd.DataFrame(nearest_neighbors_model_en, columns=['Similarity', 'Word'])
df_model_en

Unnamed: 0,Similarity,Word
0,0.782398,Payesh
1,0.74383,kheer
2,0.7379,payasa
3,0.726549,luchi
4,0.724554,mawa
5,0.724183,rabdi
6,0.723011,payasam
7,0.722865,rasmalai
8,0.717786,bhaat
9,0.713453,halwa


In [20]:
# we can also get the id of any specific word
model_en.get_label_id('payesh')

-816025

In [22]:
import pandas as pd

food = pd.read_csv("Cleaned_Indian_Food_Dataset.csv")
print(food.shape)
food.head(5)

(5938, 9)


Unnamed: 0,TranslatedRecipeName,TranslatedIngredients,TotalTimeInMins,Cuisine,TranslatedInstructions,URL,Cleaned-Ingredients,image-url,Ingredient-count
0,Masala Karela Recipe,"1 tablespoon Red Chilli powder,3 tablespoon Gr...",45,Indian,"To begin making the Masala Karela Recipe,de-se...",https://www.archanaskitchen.com/masala-karela-...,"salt,amchur (dry mango powder),karela (bitter ...",https://www.archanaskitchen.com/images/archana...,10
1,Spicy Tomato Rice (Recipe),"2 teaspoon cashew - or peanuts, 1/2 Teaspoon ...",15,South Indian Recipes,"To make tomato puliogere, first cut the tomato...",https://www.archanaskitchen.com/spicy-tomato-r...,"tomato,salt,chickpea lentils,green chilli,rice...",https://www.archanaskitchen.com/images/archana...,12
2,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1 Onion - sliced,1 teaspoon White Urad Dal (Sp...",50,South Indian Recipes,"To begin making the Ragi Vermicelli Recipe, fi...",https://www.archanaskitchen.com/ragi-vermicell...,"salt,rice vermicelli noodles (thin),asafoetida...",https://www.archanaskitchen.com/images/archana...,12
3,Gongura Chicken Curry Recipe - Andhra Style Go...,"1/2 teaspoon Turmeric powder (Haldi),1 tablesp...",45,Andhra,To begin making Gongura Chicken Curry Recipe f...,https://www.archanaskitchen.com/gongura-chicke...,"tomato,salt,ginger,sorrel leaves (gongura),fen...",https://www.archanaskitchen.com/images/archana...,15
4,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"oil - as per use, 1 tablespoon coriander seed...",30,Andhra,"To make Andhra Style Alam Pachadi, first heat ...",https://www.archanaskitchen.com/andhra-style-a...,"tomato,salt,ginger,red chillies,curry,asafoeti...",https://www.archanaskitchen.com/images/archana...,12


In [32]:
# see the instructions of the recipe
df.TranslatedInstructions[0]

'To begin making the Masala Karela Recipe,de-seed the karela and slice.\nDo not remove the skin as the skin has all the nutrients.\nAdd the karela to the pressure cooker with 3 tablespoon of water, salt and turmeric powder and pressure cook for three whistles.\nRelease the pressure immediately and open the lids.\nKeep aside.Heat oil in a heavy bottomed pan or a kadhai.\nAdd cumin seeds and let it sizzle.Once the cumin seeds have sizzled, add onions and saute them till it turns golden brown in color.Add the karela, red chilli powder, amchur powder, coriander powder and besan.\nStir to combine the masalas into the karela.Drizzle a little extra oil on the top and mix again.\nCover the pan and simmer Masala Karela stirring occasionally until everything comes together well.\nTurn off the heat.Transfer Masala Karela into a serving bowl and serve.Serve Masala Karela along with Panchmel Dal and Phulka for a weekday meal with your family.\n'

### preprocessing

In [23]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def preprocess(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token.lower() for token in tokens if token.lower() not in stop_words]

    return ' '.join(tokens)

Now we will apply the preprocess function of data instructions column. FastText expects its input data to be in the form of a text file where each line represents a document or a piece of text.

In [24]:
# Apply preprocessing to TranslatedInstructions column
df['TranslatedInstructions'] = df['TranslatedInstructions'].map(preprocess)

# Save the preprocessed data to a text file
df.to_csv("food_recipes.txt", columns=["TranslatedInstructions"], header=None, index=False)

# Train a FastText model on the preprocessed data
model = fasttext.train_unsupervised("food_recipes.txt")

# Get the nearest neighbors for the word "paneer" in the trained model
nearest_neighbors = model.get_nearest_neighbors("payesh")

# Print the nearest neighbors
print(nearest_neighbors)

[(0.878330409526825, 'nolen'), (0.8532389998435974, 'dudh'), (0.8402368426322937, 'gilefirdaus'), (0.8105514645576477, 'mishti'), (0.8076498508453369, 'paal'), (0.8022317290306091, 'pradesh'), (0.7958285212516785, 'paan'), (0.7909544706344604, 'ganesh'), (0.7895087599754333, 'paani'), (0.7852368950843811, 'pradhaman')]


[https://fasttext.cc/docs/en/unsupervised-tutorial.html](https://fasttext.cc/docs/en/unsupervised-tutorial.html) for details on parameters in `train_unsupervised` function. Based on the need, one can use the following parameters for fine-tuning:

- `epochs`: Default value is 5. Epoch is how many times it will loop over the same dataset for the training.
- `lr`: Learning rate.
- `thread`: Number of threads for the training.


In [31]:
nearest_neighbors_model = model.get_nearest_neighbors('payesh')
df_model = pd.DataFrame(nearest_neighbors_model, columns=['Similarity', 'Word'])
df_model

Unnamed: 0,Similarity,Word
0,0.87833,nolen
1,0.853239,dudh
2,0.840237,gilefirdaus
3,0.810551,mishti
4,0.80765,paal
5,0.802232,pradesh
6,0.795829,paan
7,0.790954,ganesh
8,0.789509,paani
9,0.785237,pradhaman


### Comparing
Now comparing the result before and after training the model we can get

In [38]:
df_comparison = pd.merge(df_model_en, df_model, on='Word', suffixes=('_model_en(before)', '_model(after)'), how='outer')

# Print or display the comparison DataFrame
print(df_comparison)

    Similarity_model_en(before)         Word  Similarity_model(after)
0                      0.782398       Payesh                      NaN
1                      0.743830        kheer                      NaN
2                      0.737900       payasa                      NaN
3                      0.726549        luchi                      NaN
4                      0.724554         mawa                      NaN
5                      0.724183        rabdi                      NaN
6                      0.723011      payasam                      NaN
7                      0.722865     rasmalai                      NaN
8                      0.717786        bhaat                      NaN
9                      0.713453        halwa                      NaN
10                          NaN        nolen                 0.878330
11                          NaN         dudh                 0.853239
12                          NaN  gilefirdaus                 0.840237
13                  

# Supervised Model Training
In supervised learning with FastText, the model is trained on a labeled dataset, where each text sample is associated with a predefined category or label. FastText efficiently learns text representations by considering subword information, allowing it to capture morphological and semantic features. The trained model can then classify new, unseen text samples into predefined categories based on the learned representations, making it suitable for tasks such as text classification, sentiment analysis, and topic categorization.<p>
    
Here, We are using ecommerce dataset to predict the label/class of a item where it belongs.

In [39]:
ecommerce= pd.read_csv("ecommerce_dataset.csv",  names=["category", "description"],header=None)
print(ecommerce.shape)
ecommerce.head(3)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [50]:
ecommerce.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50425 entries, 0 to 50424
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   category     50425 non-null  object
 1   description  50424 non-null  object
dtypes: object(2)
memory usage: 788.0+ KB


There is no nan values in the dataset.

In [51]:
ecommerce.description[0]

'Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and some for eternal blis

In [52]:
ecommerce.dropna(inplace=True)
ecommerce.shape

(50424, 2)

In [53]:
ecommerce.category.unique()

array(['Household', 'Books', 'Clothing & Accessories', 'Electronics'],
      dtype=object)

In [40]:
# CHECKING THE LABEL COUNT
ecommerce.category.value_counts()

category
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
Name: count, dtype: int64

## Formatting & Preprocessing
In FastText, adding a prefix like 'label' to each category and combining it with the text is a formatting requirement for supervised text classification. This prefix helps FastText recognize and distinguish between the category labels and the actual text during training. The combined format ensures that the model understands the association between the provided labels and their corresponding textual descriptions, allowing it to learn and make accurate predictions on new, unseen data.

In [54]:
ecommerce.category.replace("Clothing & Accessories", "Clothing_Accessories", inplace=True)
ecommerce['category'] = '__label__' + ecommerce['category'].astype(str)
ecommerce['category_description'] = ecommerce['category'] + ' ' + ecommerce['description']
ecommerce.head(3)


Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...


In [55]:
def preprocess(text):
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower() 

In [56]:
ecommerce['category_description'] = ecommerce['category_description'].map(preprocess)
ecommerce.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


In [None]:
### training the data

In [57]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(ecommerce, test_size=0.2) # 20% test, 80% train

In [58]:
train.shape, test.shape

((40339, 3), (10085, 3))

In [59]:
# saving the train and test data on the local dir
train.to_csv("ecommerce.train", columns=["category_description"], index=False, header=False)
test.to_csv("ecommerce.test", columns=["category_description"], index=False, header=False)

In [60]:
import fasttext

model = fasttext.train_supervised(input="ecommerce.train")
model.test("ecommerce.test")

(10084, 0.9684648948829829, 0.9684648948829829)

First parameter (10084) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96% precision which is pretty good

In [61]:
model.predict("corei 5 macbook samsung voltage sound microphone smart")


(('__label__electronics',), array([0.99999654]))

In [62]:
model.predict("Lord of the rings")


(('__label__books',), array([0.99891174]))

In [74]:
model.predict("nivea men deodorant")


(('__label__clothing_accessories',), array([0.98994315]))