# Word2Vec Embeddings v/s fastText

In [15]:
from gensim.models.word2vec import LineSentence
from gensim.models.word2vec import Word2Vec

fastText embeddings are just an extension of Word2Vec embeddings.

fastText also takes into account the morphological nuances while learning the word representations

In this notebook, we will try to get an idea of the pros and cons of both the techniques.

For testing purposes, we will use the skipgram architecture for the word embeddings.
So, from the training perspective, note the following:
1. Gensim Word2Vector : skipgram
2. fastText : skipgram  + character level n-grams

Create a directory called `models/` where we will persist our trained models

In [17]:
MODELS = 'models/'
!mkdir -p {MODELS}

# Training parameters

Define the parameters and the method to train.

We will compare 3 different methods

1. Gensim's Word2Vector embedding
2. Facebook's FastText with char n-grams
3. FastText without char n-grams.

The main purpose is to understand the difference between facebook's fastText and gensim's word2vec.

Both of them represent words in vectors; However, fastText also adds char n-grams to the word representation.
This is beneficial when we have to take care of the morphological nuances.

Simple word2vec tokenizes based on white spaces. 

The following image explains char-level n-grams

<img src="./files/29443.jpg">



In [37]:
import os
import time

lr = 0.05
dim = 100
ws = 5
epoch = 5
minCount = 5
neg = 5
loss = 'ns'
t = 1e-4

params = {
    'alpha': lr,
    'size': dim,
    'window': ws,
    'iter': epoch,
    'min_count': minCount,
    'sample': t,
    'sg': 1,
    'hs': 0,
    'negative': neg
}



def train_models(corpus, output_name):
    out_file = '{0}_ft'.format(output_name)
    if not os.path.isfile(os.path.join(MODELS, '{:s}.vec'.format(out_file))):
        print('Training fasttext on {:s} corpus..'.format(corpus))
        %time !/home/abhishek/fastText/fasttext skipgram -input {corpus} -output {MODELS+out_file} -lr {lr} -dim {dim} -ws {ws} -epoch {epoch} -minCount {minCount} -neg {neg} -loss {loss} -t {t}
    else:
        print '\nUsing existing model file {:s}.vec'.format(out_file)

    print "\n"
    
    out_file  = '{0}_ft_no_char_ng'.format(output_name)
    if not os.path.isfile(os.path.join(MODELS, '{:s}.vec'.format(out_file))):
        print('Training fasttext with 0 char n-grams on {:s} corpus..'.format(corpus))
        %time !/home/abhishek/fastText/fasttext skipgram -input {corpus} -output {MODELS+out_file} -lr {lr} -dim {dim} -ws {ws} -epoch {epoch} -minCount {minCount} -neg {neg} -loss {loss} -t {t} -maxn 0
    else:
        print '\nUsing existing model file {:s}.vec'.format(out_file)
        
    print "\n"
        
    out_file = '{0}_word2vec'.format(output_name)
    if not os.path.isfile(os.path.join(MODELS, '{:s}.vec'.format(out_file))):
        print('Training Gensim word2vector on {:s} corpus..'.format(corpus))
        start = time.time()
        model = Word2Vec(LineSentence(corpus), **params)
        end = time.time()
        print "Time  : ", end - start
        model.save(MODELS+out_file+'.vec')
    else:
        print '\nUsing existing model file {:s}.vec'.format(out_file)

# Training 

For training the models, I have used the brown corpus.(can be easily downloaded using nltk python library)

Below, you will see the time taken by each of the 3 types of embeddings.

As you see below, fastText takes significant amount of time 

1. fastText (with character level n-grams) 54.5s
2. fastText (with out character level n-grams) 19s
3. gensim's Word2Vector 17s

fastText without character level n-grams should theoritically be same as gensim's Word2Vector.Thus the time taken is almost the same.

In [38]:
train_models('brown_corp.txt', 'brown')

Training fasttext on brown_corp.txt corpus..
Read 1M words
Number of words:  15173
Number of labels: 0
Progress: 100.0%  words/sec/thread: 50818  lr: 0.000000  loss: 2.395198  eta: 0h0m 4.0%  words/sec/thread: 37170  lr: 0.038002  loss: 2.372056  eta: 0h0m 24.2%  words/sec/thread: 37329  lr: 0.037887  loss: 2.371987  eta: 0h0m 2.378887  eta: 0h0m 33.8%  words/sec/thread: 42972  lr: 0.033103  loss: 2.382141  eta: 0h0m 34.1%  words/sec/thread: 43048  lr: 0.032935  loss: 2.382135  eta: 0h0m 37.9%  words/sec/thread: 44669  lr: 0.031046  loss: 2.384235  eta: 0h0m 38.0%  words/sec/thread: 44786  lr: 0.030976  loss: 2.384403  eta: 0h0m 38.3%  words/sec/thread: 44894  lr: 0.030834  loss: 2.385638  eta: 0h0m 40.1%  words/sec/thread: 45610  lr: 0.029943  loss: 2.388759  eta: 0h0m 40.3%  words/sec/thread: 45636  lr: 0.029846  loss: 2.389348  eta: 0h0m 40.6%  words/sec/thread: 45730  lr: 0.029705  loss: 2.390106  eta: 0h0m 41.0%  words/sec/thread: 45930  lr: 0.029510  loss: 2.391627  eta: 0h0m 44.

# Results : 

To test the performance of the word embeddings, we will use the 'questions-words.txt' 

In [28]:
!wget https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt

--2017-05-16 12:38:50--  https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.64.133, 151.101.128.133, 151.101.192.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.64.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 603955 (590K) [text/plain]
Saving to: ‘questions-words.txt’


2017-05-16 12:38:51 (1.96 MB/s) - ‘questions-words.txt’ saved [603955/603955]



# Accuracy Scores 

The accuracy scores have been reported based on semantic and syntactic tasks.

In [41]:
import gensim 

def print_accuracy(model, questions_file):
    print('Evaluating...\n')
    acc = model.accuracy(questions_file)

    sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
    sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
    sem_acc = 100*float(sem_correct)/sem_total
    print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, sem_acc))
    
    syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
    syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
    syn_acc = 100*float(syn_correct)/syn_total
    print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, syn_acc))
    return (sem_acc, syn_acc)

word_analogies_file = 'questions-words.txt'
accuracies = []

# Accuracy for Word2Vec(Gensim): 

In [42]:
print('\nLoading Gensim embeddings')
myCorpus_gs = Word2Vec.load(MODELS + 'brown_word2vec.vec')
print('Accuracy for Word2Vec:')
accuracies.append(print_accuracy(myCorpus_gs, word_analogies_file))


Loading Gensim embeddings
Accuracy for Word2Vec:
Evaluating...


Semantic: 30/813, Accuracy: 3.69%
Syntactic: 97/5255, Accuracy: 1.85%



# Accuracy for FastText (with n-grams):

In [43]:
print('\nLoading FastText embeddings')
ft_model = gensim.models.KeyedVectors.load_word2vec_format(MODELS + 'brown_ft.vec')
print('Accuracy for FastText (with n-grams):')
accuracies.append(print_accuracy(ft_model, word_analogies_file))


Loading FastText embeddings
Accuracy for FastText (with n-grams):
Evaluating...


Semantic: 47/813, Accuracy: 5.78%
Syntactic: 2629/5255, Accuracy: 50.03%



# Accuracy for FastText (without char n-grams):

In [45]:
print('\nLoading FastText embeddings without char n-grams')
ft_model = gensim.models.KeyedVectors.load_word2vec_format(MODELS + 'brown_ft_no_char_ng.vec')
print('Accuracy for FastText (without char n-grams):')
accuracies.append(print_accuracy(ft_model, word_analogies_file))


Loading FastText embeddings without char n-grams
Accuracy for FastText (without char n-grams):
Evaluating...


Semantic: 43/813, Accuracy: 5.29%
Syntactic: 119/5255, Accuracy: 2.26%



# Summary : 

We see that, for syntactic tasks, `fastText`(with character level  n-grams) has performed significantly better than gensim's word2vec. 

The results for semantic tasks are pretty much the same for both `word2vec` and `fastText`