In [1]:
%matplotlib inline


FastText Model
==============

Introduces Gensim's fastText model and demonstrates its use on the Lee Corpus.



In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Here, we'll learn to work with fastText library for training word-embedding
models, saving & loading them and performing similarity operations & vector
lookups analogous to Word2Vec.



When to use FastText?
---------------------

The morphological structure of a word carries important information about its meaning, which is not taken into account by traditional word embeddings which train a unique word embedding for every individual word.

This is significant for morphologically rich languages (German, Turkish) in which a single word can have a large number of morphological forms, each of which might occur rarely, thus making it hard to train good word embeddings.

fastText treats each word as the aggregation of its subwords. Subwords are taken to be the character ngrams of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams.

fastText does significantly better on syntactic tasks as compared to the original Word2Vec, especially on smaller training sets. Word2Vec slightly outperforms FastText on semantic tasks. The differences grow smaller as the size of training corpus increases.

Training time for fastText is higher than the Gensim version of Word2Vec.

fastText can be used to obtain vectors for out-of-vocabulary (OOV) words, by summing up vectors for its component char-ngrams, if least one of the char-ngrams was present in the training data.

Training models
---------------




For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model.






In [3]:
from pprint import pprint as print
from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FT_gensim(size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words
)

print(model)

2020-04-21 17:41:52,446 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-21 17:41:52,447 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
2020-04-21 17:41:52,448 : INFO : resetting layer weights
2020-04-21 17:41:58,614 : INFO : collecting all words and their counts
2020-04-21 17:41:58,672 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-21 17:41:58,689 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2020-04-21 17:41:58,690 : INFO : Loading a fresh vocabulary
2020-04-21 17:41:58,696 : INFO : effective_min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
2020-04-21 17:41:58,697 : INFO : effective_min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
2020-04-21 17:41:58,702 : INFO : deleting the raw counts dictionary of 10781 items
2020-04-21 17:41:58,703 

<gensim.models.fasttext.FastText object at 0x7f93e20a2c18>


### Training hyperparameters

Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec:

- model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)
- size: Size of embeddings to be learnt (Default 100)
- alpha: Initial learning rate (Default 0.025)
- window: Context window size (Default 5)
- min_count: Ignore words with number of occurrences below this (Default 5)
- loss: Training objective. Allowed values: `ns`, `hs`, `softmax` (Default `ns`)
- sample: Threshold for downsampling higher-frequency words (Default 0.001)
- negative: Number of negative words to sample, for `ns` (Default 5)
- iter: Number of epochs (Default 5)
- sorted_vocab: Sort vocab by descending frequency (Default 1)
- threads: Number of threads to use (Default 12)

FastText has three additional parameters:

- min_n: min length of char ngrams (Default 3)
- max_n: max length of char ngrams (Default 6)
- bucket: number of buckets used for hashing ngrams (Default 2000000)


``min_n`` and ``max_n`` control the lengths of character ngrams that each word is broken down into while training and looking up embeddings. If ``max_n`` is set to 0, or to be lesser than ``min_n``\ , no character ngrams are used, and the model effectively reduces to Word2Vec.

To bound the memory requirements of the model being trained, a hashing function maps ngrams to integers in 1 to K. For hashing these character sequences, the `Fowler-Noll-Vo hashing function <http://www.isthe.com/chongo/tech/comp/fnv>`_ (FNV-1a variant) is employed.

**Note:** As in the case of Word2Vec, you can continue to train your model while using Gensim's native implementation of fastText.




Saving/loading models
---------------------




Models can be saved and loaded via the ``load`` and ``save`` methods.




In [4]:
# saving a model trained via Gensim's fastText implementation
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
    model.save(tmp.name, separately=[])

loaded_model = FT_gensim.load(tmp.name)
print(loaded_model)

os.unlink(tmp.name)

2020-04-21 17:46:47,504 : INFO : saving FastText object under /tmp/saved_model_gensim-zik128cz, separately []
2020-04-21 17:46:47,505 : INFO : storing np array 'vectors_ngrams' to /tmp/saved_model_gensim-zik128cz.wv.vectors_ngrams.npy
2020-04-21 17:46:47,909 : INFO : not storing attribute vectors_norm
2020-04-21 17:46:47,910 : INFO : not storing attribute vectors_vocab_norm
2020-04-21 17:46:47,910 : INFO : not storing attribute vectors_ngrams_norm
2020-04-21 17:46:47,911 : INFO : not storing attribute buckets_word
2020-04-21 17:46:47,912 : INFO : storing np array 'vectors_ngrams_lockf' to /tmp/saved_model_gensim-zik128cz.trainables.vectors_ngrams_lockf.npy
2020-04-21 17:46:48,274 : INFO : saved /tmp/saved_model_gensim-zik128cz
2020-04-21 17:46:48,275 : INFO : loading FastText object from /tmp/saved_model_gensim-zik128cz
2020-04-21 17:46:48,293 : INFO : loading wv recursively from /tmp/saved_model_gensim-zik128cz.wv.* with mmap=None
2020-04-21 17:46:48,294 : INFO : loading vectors_ngram

<gensim.models.fasttext.FastText object at 0x7f9376861c50>


The ``save_word2vec_method`` causes the vectors for ngrams to be lost. As a result, a model loaded in this way will behave as a regular word2vec model.




Word vector lookup
------------------


**Note:** Operations like word vector lookups and similarity queries can be performed in exactly the same manner for both the implementations of fastText so they have been demonstrated using only the native fastText implementation here.



FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.




In [5]:
print('night' in model.wv.vocab)

True


In [6]:
print('nights' in model.wv.vocab)

False


In [7]:
print(model['night'])

array([ 0.10632875,  0.01255457, -0.5669992 ,  0.52888554,  0.57501566,
       -0.34746435, -0.20394236, -0.00705642,  0.42954668,  0.33390266,
       -0.618622  , -0.0305591 , -0.64079636,  0.40977755,  0.3122134 ,
       -0.03818988, -0.14270179,  0.1782077 ,  0.24284893, -0.37909204,
       -0.23392743,  0.2787959 , -0.4048896 , -0.01705266, -0.81833357,
        0.71984917,  0.10700478,  0.14917165,  0.40857723, -0.0035423 ,
       -0.6586291 ,  0.21239302,  0.07544171, -0.46844804,  0.4637418 ,
        0.13856308, -0.16215467, -0.10881548,  0.44739377,  0.25178418,
       -0.02010422, -0.07718582,  0.3947144 , -0.06683552,  0.13483599,
        0.20669809, -0.16199008,  0.22898626, -0.04825592, -0.372121  ,
       -0.5384312 , -0.55837435,  0.11150873,  0.00262361,  0.4515773 ,
       -0.78977716, -0.11445688, -0.14899562,  0.02899424, -0.02960676,
        0.2014096 , -0.1658334 , -0.48369506, -0.05303399, -0.54290277,
        0.41445872, -0.02554515,  0.1757904 ,  0.03581964,  0.48

  """Entry point for launching an IPython kernel.


In [8]:
print(model['nights'])

array([ 0.09313107,  0.01144097, -0.4920465 ,  0.4582299 ,  0.49806002,
       -0.30292475, -0.17764547, -0.00533584,  0.37240922,  0.29038686,
       -0.5388981 , -0.02793436, -0.55630517,  0.35698062,  0.2669267 ,
       -0.03287829, -0.12595409,  0.15291354,  0.20911163, -0.32940567,
       -0.20298642,  0.2428599 , -0.35204935, -0.01412793, -0.71082264,
        0.62443584,  0.09284382,  0.12989873,  0.35468996, -0.00346226,
       -0.56945795,  0.1855444 ,  0.06412946, -0.408463  ,  0.4020396 ,
        0.11991416, -0.13922669, -0.09567924,  0.38805512,  0.21922219,
       -0.01653666, -0.06709076,  0.3418985 , -0.05627145,  0.11587152,
        0.1795545 , -0.14408827,  0.20041627, -0.04250341, -0.32138348,
       -0.46961102, -0.48358548,  0.09788522,  0.00322538,  0.3923521 ,
       -0.68530947, -0.09937615, -0.12812673,  0.02282381, -0.02539116,
        0.174292  , -0.14470427, -0.41939905, -0.04661592, -0.47152534,
        0.35898882, -0.02202755,  0.1523797 ,  0.03281567,  0.42

  """Entry point for launching an IPython kernel.


The ``in`` operation works slightly differently from the original word2vec. It tests whether a vector for the given word exists or not, not whether the word is present in the word vocabulary. To test whether a word is present in the training word vocabulary -




Tests if word present in vocab



In [9]:
print("word" in model.wv.vocab)

False


Tests if vector present for word



In [10]:
print("word" in model)

True


  """Entry point for launching an IPython kernel.


Similarity operations
---------------------




Similarity operations work the same way as word2vec. **Out-of-vocabulary words can also be used, provided they have at least one character ngram present in the training data.**




In [11]:
print("nights" in model.wv.vocab)

False


In [12]:
print("night" in model.wv.vocab)

True


In [15]:
#print(model.similarity("night", "nights"))
print(model.wv.similarity("night","nights"))

0.9999927


Syntactically similar words generally have high similarity in fastText models, since a large number of the component char-ngrams will be the same. As a result, fastText generally does better at syntactic tasks than Word2Vec. A detailed comparison is provided `here <Word2Vec_FastText_Comparison.ipynb>`_.

In [16]:
print(model.most_similar("nights"))

  """Entry point for launching an IPython kernel.
2020-04-21 17:50:07,319 : INFO : precomputing L2-norms of word weight vectors
2020-04-21 17:50:07,321 : INFO : precomputing L2-norms of ngram weight vectors


[('study', 0.9982695579528809),
 ('boat', 0.9982630610466003),
 ('often', 0.998261570930481),
 ('Arafat', 0.9982523918151855),
 ('north.', 0.9982521533966064),
 ('Arafat,', 0.9982487559318542),
 ('"That', 0.9982447028160095),
 ('stage', 0.9982336759567261),
 ('heard', 0.9982335567474365),
 ('Commonwealth', 0.9982331395149231)]


In [17]:
print(model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))

0.99995166


  """Entry point for launching an IPython kernel.


In [18]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

'breakfast'


  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


In [19]:
print(model.most_similar(positive=['baghdad', 'england'], negative=['london']))

[('1', 0.24227944016456604),
 ('40', 0.23731181025505066),
 ('2', 0.23469692468643188),
 ('20', 0.2326391488313675),
 ('UN', 0.23258794844150543),
 ('26', 0.23256193101406097),
 ('blaze', 0.23193609714508057),
 ('keep', 0.23132377862930298),
 ('...', 0.23056435585021973),
 ('As', 0.23053264617919922)]


  """Entry point for launching an IPython kernel.


In [20]:
print(model.accuracy(questions=datapath('questions-words.txt')))

  """Entry point for launching an IPython kernel.
2020-04-21 17:50:22,762 : INFO : family: 0.0% (0/2)
2020-04-21 17:50:22,799 : INFO : gram3-comparative: 25.0% (3/12)
2020-04-21 17:50:22,818 : INFO : gram4-superlative: 25.0% (3/12)
2020-04-21 17:50:22,860 : INFO : gram5-present-participle: 20.0% (4/20)
2020-04-21 17:50:22,905 : INFO : gram6-nationality-adjective: 40.0% (8/20)
2020-04-21 17:50:22,940 : INFO : gram7-past-tense: 5.0% (1/20)
2020-04-21 17:50:22,983 : INFO : gram8-plural: 8.3% (1/12)
2020-04-21 17:50:22,991 : INFO : total: 20.4% (20/98)


[{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},
 {'correct': [], 'incorrect': [], 'section': 'capital-world'},
 {'correct': [], 'incorrect': [], 'section': 'currency'},
 {'correct': [], 'incorrect': [], 'section': 'city-in-state'},
 {'correct': [],
  'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')],
  'section': 'family'},
 {'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},
 {'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},
 {'correct': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
              ('GREAT', 'GREATER', 'LOW', 'LOWER'),
              ('LONG', 'LONGER', 'GREAT', 'GREATER')],
  'incorrect': [('GOOD', 'BETTER', 'LONG', 'LONGER'),
                ('GOOD', 'BETTER', 'LOW', 'LOWER'),
                ('GREAT', 'GREATER', 'LONG', 'LONGER'),
                ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
                ('LONG', 'LONGER', 'LOW', 'LOWER'),
                ('LONG', 'LONGER', 'GOOD', 'BETTER'),
  

### Word Movers distance
Let's start with two sentences:

In [22]:
sentence_obama     = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

Remove their stopwords.




In [25]:
import nltk
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

[earth mover's distance](https://en.wikipedia.org/wiki/Earth_mover%27s_distance) - computed by the [pyemd](https://pypi.org/project/pyemd/) package:



In [28]:
import pyemd
distance = model.wv.wmdistance(sentence_obama, sentence_president)
print(distance)

2020-04-21 17:55:21,396 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-04-21 17:55:21,397 : INFO : built Dictionary(8 unique tokens: ['illinois', 'media', 'obama', 'speaks', 'chicago']...) from 2 documents (total 8 corpus positions)


1.3918907387763262
