# FastText
- FastText is an extension of Word2Vec, developed by Facebook AI Research (FAIR), which incorporates subword information into word embeddings.
- It breaks words into smaller parts called character n-grams, allowing it to generate embeddings for out-of-vocabulary words and morphologically rich languages more effectively than traditional word embeddings.

## Key Concepts of FastText
- **Subword Information:** Considers character n-grams to represent words, allowing it to capture morphological information.
- **Out-of-Vocabulary (OOV) Words:** Can generate embeddings for words not seen during training by using their character n-grams.
- **Morphologically Rich Languages:** Particularly effective for languages with complex morphology, where words share common roots or affixes.

## Implementing FastText in Python

In [1]:
from gensim.models import FastText
import nltk

# Sample documents
documents = [
    "Cats are beautiful animals.",
    "Dogs are loyal and friendly animals.",
    "Cats and dogs are popular pets.",
    "I love my dog.",
    "My cat is very playful."
]

# Tokenize the documents
nltk.download('punkt')
tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in documents]

# Train the FastText model
fasttext_model = FastText(sentences=tokenized_docs, vector_size=100, window=5, min_count=1, workers=4)

# Access the vector for a specific word
cat_vector_fasttext = fasttext_model.wv['cat']
print("FastText Vector for 'cat':\n", cat_vector_fasttext)

# Find the most similar words to 'cat'
similar_words_fasttext = fasttext_model.wv.most_similar('cat')
print("Most similar words to 'cat' (FastText):\n", similar_words_fasttext)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


FastText Vector for 'cat':
 [-1.8754165e-03 -2.3678362e-03 -1.0529048e-03  1.8630450e-03
  2.0640636e-04  5.0583123e-05  2.9747063e-04  4.3057506e-03
  2.4392866e-03  3.8923461e-03  5.1995553e-04 -3.2677448e-03
 -3.4944762e-03  1.9666620e-03 -2.8560491e-05  3.1827211e-03
 -4.0411949e-04  6.2053686e-04 -1.5469597e-03 -6.7263137e-04
 -9.2421717e-04  4.8000477e-03 -1.4299202e-03  1.4487315e-03
 -7.3017151e-04 -2.1995048e-03 -2.8899012e-03 -2.9135381e-03
 -7.1256474e-04  9.2400465e-04  2.2049967e-04  3.8951454e-03
  1.0249827e-03  3.3887629e-03  1.5330126e-03 -1.0979780e-03
 -1.9355581e-04 -1.3542820e-03 -7.1068265e-04 -4.0643113e-03
 -6.0755352e-04 -4.9040904e-03 -4.2875302e-03  2.0273456e-03
 -4.6635056e-03 -1.9128687e-03 -1.4119400e-03  2.0373610e-03
 -2.4908867e-03 -2.1145806e-04 -1.7797448e-03 -4.3296893e-03
  1.8520225e-03  2.0005328e-04  1.3031012e-04  3.2806215e-03
 -1.2613607e-03  1.4536155e-03  1.1557203e-03 -3.3751929e-03
  1.2583283e-03  1.5714315e-03 -1.6367539e-03  2.0632986e

## Advantages of FastText
- **Subword Information:** Captures morphological and semantic similarities between words more effectively.
- **OOV Word Handling:** Can generate embeddings for out-of-vocabulary words based on their character n-grams.
- **Efficient Training:** Despite the additional subword information, training FastText remains efficient.

## Considerations
- **Training Data:** Quality and size of the training corpus significantly affect the quality of embeddings.
- **Hyperparameters:** Parameters like vector size, window size, and minimum count need to be tuned appropriately.
- **Preprocessing:** Proper preprocessing (tokenization, lowercasing) is still important for good results.