In [1]:
%matplotlib inline


FastText Model
==============

Introduces Gensim's fastText model and demonstrates its use on the Lee Corpus.



In [1]:
import pprint
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

When to use FastText?
---------------------

* The morphological structure of a word carries information about its meaning which is not taken into account by traditional word embeddings which train an embedding for each word.

* This is significant for morphologically rich languages (ex: German, Turkish) in which a single word can have a large number of morphological forms, each of which might occur rarely, thus making it hard to train good word embeddings.

* _fastText treats each word as the aggregation of its subwords_. Subwords are taken to be the character ngrams of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams.

* fastText does significantly better on **syntactic tasks** vs the original Word2Vec, especially on smaller training sets. Word2Vec slightly outperforms FastText on **semantic tasks**. The differences grow smaller as the size of training corpus increases.

* Training time for fastText is higher than the Gensim version of Word2Vec.

* fastText can be used to obtain vectors for out-of-vocabulary (OOV) words, by summing up vectors for its component char-ngrams, if least one of the char-ngrams was present in the training data.

Training models
---------------




For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model.






In [2]:
from pprint import pprint as print
from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')
model       = FT_gensim(size=100)

model.build_vocab(corpus_file=corpus_file)

model.train(
    corpus_file    = corpus_file, 
    epochs         = model.epochs,
    total_examples = model.corpus_count, 
    total_words    = model.corpus_total_words
)

print(model)

unable to import 'smart_open.gcs', disabling that module


<gensim.models.fasttext.FastText object at 0x7fa3c26d1128>


### Training hyperparameters

Hyperparameters for training the model follow the same pattern as Word2Vec. FastText supports the following parameters from the original word2vec:

- model: Training architecture. Allowed values: `cbow`, `skipgram` (Default `cbow`)
- size: Size of embeddings to be learnt (Default 100)
- alpha: Initial learning rate (Default 0.025)
- window: Context window size (Default 5)
- min_count: Ignore words with number of occurrences below this (Default 5)
- loss: Training objective. Allowed values: `ns`, `hs`, `softmax` (Default `ns`)
- sample: Threshold for downsampling higher-frequency words (Default 0.001)
- negative: Number of negative words to sample, for `ns` (Default 5)
- iter: Number of epochs (Default 5)
- sorted_vocab: Sort vocab by descending frequency (Default 1)
- threads: Number of threads to use (Default 12)

FastText has three additional parameters:

- _min_n_: min length of char ngrams (Default 3)
- _max_n_: max length of char ngrams (Default 6)
- _bucket_: number of buckets used for hashing ngrams (Default 2000000)

* _min_n_ and _max_n_ control the lengths of character ngrams that each word is broken down into while training and looking up embeddings. If ``max_n`` is set to 0 or less thatn _min_n_, no character ngrams are used - the model effectively reduces to Word2Vec.

* A hashing function maps ngrams to integers [1-K]. The [Fowler-Noll-Vo hashing function](http://www.isthe.com/chongo/tech/comp/fnv) (FNV-1a variant) is used.

**Note:** As in the case of Word2Vec, you can continue to train your model while using Gensim's native implementation of fastText.




Saving/loading models
---------------------




Models can be saved and loaded via the ``load`` and ``save`` methods.




In [3]:
# saving a model trained via Gensim's fastText implementation
import tempfile
import os
with tempfile.NamedTemporaryFile(prefix='saved_model_gensim-', delete=False) as tmp:
    model.save(tmp.name, separately=[])

loaded_model = FT_gensim.load(tmp.name)
print(loaded_model)

os.unlink(tmp.name)

<gensim.models.fasttext.FastText object at 0x7fa3c26d10b8>


* _save_word2vec_method_ causes the ngram vectors to be lost - so models loaded this way will behave as regular word2vec models.

Word vector lookup
------------------
**Note:** Operations like word vector lookups and similarity queries work exactly like their FastText equivalents - so they have been demonstrated using only the native fastText implementation here.
* FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.

In [4]:
print('night' in model.wv.vocab)

True


In [5]:
print('nights' in model.wv.vocab)

False


In [6]:
print(model['night'])

array([ 0.0951848 ,  0.00439684, -0.5730612 ,  0.48200205,  0.59976846,
       -0.31887755, -0.18905407, -0.04429929,  0.42820224,  0.33637044,
       -0.6341972 , -0.01148262, -0.67034054,  0.41099748,  0.279039  ,
       -0.0609754 , -0.19455384,  0.18634835,  0.2372648 , -0.3728167 ,
       -0.2411645 ,  0.27350628, -0.37734354,  0.02136977, -0.82818276,
        0.7267074 ,  0.11816929,  0.17816381,  0.4048582 ,  0.0064444 ,
       -0.6489303 ,  0.23216677,  0.08447264, -0.47895163,  0.46647942,
        0.10955972, -0.16104989, -0.0692726 ,  0.41848487,  0.21457395,
       -0.00909338, -0.07110799,  0.38920757, -0.06391519,  0.12494167,
        0.18315583, -0.13085683,  0.21250576, -0.01720706, -0.3876734 ,
       -0.568277  , -0.55089724,  0.06869221,  0.01142327,  0.40094298,
       -0.79913646, -0.10639761, -0.1799068 ,  0.01203778, -0.02949148,
        0.21544454, -0.12062592, -0.48480937, -0.09911016, -0.53330535,
        0.35012123,  0.00529964,  0.15234514,  0.03122097,  0.47

  """Entry point for launching an IPython kernel.


In [7]:
print(model['nights'])

array([ 0.08390839,  0.00439281, -0.49997562,  0.4197932 ,  0.52237827,
       -0.2796327 , -0.16560148, -0.03786143,  0.373268  ,  0.29412517,
       -0.55540466, -0.01145034, -0.5850891 ,  0.35996655,  0.23944286,
       -0.05291162, -0.17186531,  0.1608732 ,  0.20536552, -0.32568985,
       -0.21039045,  0.23954262, -0.32993677,  0.01930434, -0.72327334,
        0.63382936,  0.10306088,  0.15591006,  0.35340077,  0.0052096 ,
       -0.5640832 ,  0.20378862,  0.07240348, -0.4200235 ,  0.40685552,
        0.09530769, -0.13907899, -0.06173637,  0.3651373 ,  0.18804902,
       -0.00701293, -0.06219959,  0.33914077, -0.05404743,  0.10795779,
        0.16007127, -0.11776429,  0.18721513, -0.01566308, -0.33687627,
       -0.49844742, -0.47994205,  0.06110583,  0.01093521,  0.35052204,
       -0.6975338 , -0.09294295, -0.15586945,  0.00817089, -0.02542602,
        0.18758976, -0.10613206, -0.42290136, -0.08708555, -0.46591488,
        0.3049851 ,  0.00474294,  0.13284114,  0.02900123,  0.41

  """Entry point for launching an IPython kernel.


* The _in_ operation works slightly differently than the original word2vec implementation. It tests whether a vector for the given word exists or not, not whether the word is present in the word vocabulary.

In [8]:
print("word" in model.wv.vocab)

False


In [9]:
print("word" in model)

True


  """Entry point for launching an IPython kernel.


Similarity operations
---------------------




* Similarity operations work the same way as word2vec. **Out-of-vocabulary words can also be used, provided they have at least one character ngram present in the training data.**

In [10]:
print("nights" in model.wv.vocab)

False


In [11]:
print("night" in model.wv.vocab)

True


In [12]:
#print(model.similarity("night", "nights"))
print(model.wv.similarity("night","nights"))

0.9999927


* Syntactically similar words will have high similarity scores in fastText models, since a large number of the component char-ngrams will be the same. Therefore fastText usually does better at syntactic tasks than Word2Vec. 
* See [here](Word2Vec_FastText_Comparison.ipynb)

In [13]:
print(model.most_similar("nights"))

  """Entry point for launching an IPython kernel.


[('Arafat', 0.9982719421386719),
 ('study', 0.9982718825340271),
 ('"That', 0.9982710480690002),
 ('boat', 0.9982693791389465),
 ('Arafat,', 0.9982660412788391),
 ('often', 0.9982610940933228),
 ('north.', 0.9982523918151855),
 ('Endeavour', 0.998251736164093),
 ("Arafat's", 0.9982442855834961),
 ('details', 0.9982432126998901)]


In [14]:
print(model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant']))

0.9999513


  """Entry point for launching an IPython kernel.


In [15]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

'breakfast'


  """Entry point for launching an IPython kernel.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


In [16]:
print(model.most_similar(positive=['baghdad', 'england'], negative=['london']))

[('1', 0.2420959174633026),
 ('40', 0.2376674860715866),
 ('2', 0.23438993096351624),
 ('26', 0.23292623460292816),
 ('20', 0.23248399794101715),
 ('UN', 0.23195061087608337),
 ('blaze', 0.23188075423240662),
 ('keep', 0.2311323881149292),
 ('...', 0.23067551851272583),
 ('As', 0.23064441978931427)]


  """Entry point for launching an IPython kernel.


In [17]:
print(model.accuracy(questions=datapath('questions-words.txt')))

  """Entry point for launching an IPython kernel.


[{'correct': [], 'incorrect': [], 'section': 'capital-common-countries'},
 {'correct': [], 'incorrect': [], 'section': 'capital-world'},
 {'correct': [], 'incorrect': [], 'section': 'currency'},
 {'correct': [], 'incorrect': [], 'section': 'city-in-state'},
 {'correct': [],
  'incorrect': [('HE', 'SHE', 'HIS', 'HER'), ('HIS', 'HER', 'HE', 'SHE')],
  'section': 'family'},
 {'correct': [], 'incorrect': [], 'section': 'gram1-adjective-to-adverb'},
 {'correct': [], 'incorrect': [], 'section': 'gram2-opposite'},
 {'correct': [('GOOD', 'BETTER', 'GREAT', 'GREATER'),
              ('LONG', 'LONGER', 'GREAT', 'GREATER')],
  'incorrect': [('GOOD', 'BETTER', 'LONG', 'LONGER'),
                ('GOOD', 'BETTER', 'LOW', 'LOWER'),
                ('GREAT', 'GREATER', 'LONG', 'LONGER'),
                ('GREAT', 'GREATER', 'LOW', 'LOWER'),
                ('GREAT', 'GREATER', 'GOOD', 'BETTER'),
                ('LONG', 'LONGER', 'LOW', 'LOWER'),
                ('LONG', 'LONGER', 'GOOD', 'BETTER'),


### Word Movers distance
* Start with two sentences.

In [18]:
sentence_obama     = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

* Remove their stopwords.

In [19]:
import nltk
from nltk.corpus import stopwords

stopwords          = stopwords.words('english')
sentence_obama     = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

* [earth mover's distance](https://en.wikipedia.org/wiki/Earth_mover%27s_distance) - computed by the [pyemd](https://pypi.org/project/pyemd/) package:



In [20]:
import pyemd
distance = model.wv.wmdistance(sentence_obama, sentence_president)
print(distance)

1.389856288520813
