# FastText

Facebook AI Research lab developed an open-source word-embedding library called `FastText` with the purpose of achieving more accurate and scalable solutions qucikly while processing large text data. Similar to `GloVe` Word Embedding, `FastText` is also the modified version of `Word2Vec`.

Unlike Word2Vec which feeds individual words to neural network, FastText breaks a word into character n-grams and then feeds those character n-grams to the neural network. For instance: the tri-gram of the word fasttext is:

`<fa`, `fas`, `ast`, `stt`, `tte`, `tex`, `ext`, `xt>`

The embedding vectors for each of these words are obtained after training the neural network. These independent embedding vectors are finally added up to obtain the word embedding vector of the original word `fasttext`.

**How is FastText better than Word2Vec?**

- Compound words like `fasttext` can be properly represented even if the data do not contain the word `fasttext` as other words like `fast` and `text` contain the same n-grams.
- Even though the words like `fast`, `faster`, `fastest` share the same redical, word2vec handles them independently according to the context. FastText on the other hand facilitates parameter sharing among such words and does efficient utilization of the morphological structure.

Let's try implementing it for real. Python provides an open-source library `gensim` that makes working with fasttext easy. Let's begin by installing gensim library. We will use `nltk` for preprocessing, so let's install both of the libraries.

```
pip3 install nltk
pip3 install gensim
```

In [1]:
from gensim.models import FastText
from nltk.tokenize import word_tokenize

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/arun/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
with open('./../data/shakespeare.txt') as file:
    data = file.read().split('\n')

data = [word_tokenize(sentence) for sentence in data]

In [4]:
model = FastText(data, vector_size=128, window=5, min_count=5, workers=4,sg=1)
model.save('./../models/fasttext.ft')

In [5]:
fmodel = model.wv

In [6]:
fmodel['king']

array([-0.11217783,  0.20577465,  0.21235843, -0.05980067,  0.03176756,
       -0.14845176,  0.2166336 ,  0.32104605, -0.01718167, -0.22685342,
        0.23097259,  0.05221527, -0.18078251, -0.09492142, -0.11748819,
       -0.27803648,  0.04973472,  0.17366092, -0.20142494, -0.09581104,
       -0.1061855 ,  0.01365055,  0.3170049 , -0.3091896 , -0.00823896,
       -0.02725019, -0.1531331 ,  0.19494449, -0.09518225, -0.07013408,
       -0.3015556 ,  0.5219829 , -0.01059348, -0.02373229, -0.00484098,
        0.02774051,  0.01390303,  0.22808442,  0.2127559 , -0.09040981,
       -0.10189486,  0.15964822,  0.13023494,  0.05941271,  0.01029881,
        0.39144734, -0.27903053, -0.09481051,  0.01477321,  0.13662024,
       -0.04695304,  0.16299796, -0.15678835,  0.14295399,  0.31410298,
       -0.00992163, -0.17167176, -0.01808323, -0.07789619, -0.10428111,
        0.12600031, -0.07776643,  0.10116425, -0.13384894, -0.42259532,
        0.05552647,  0.1492907 ,  0.07554794,  0.06587178, -0.28

In [7]:
fmodel.similar_by_word('king', topn=5)

[('Bring', 0.9997567534446716),
 ('spring', 0.9997466802597046),
 ('Being', 0.9997231960296631),
 ('night', 0.999680757522583),
 ('bring', 0.9996647238731384)]