Word embedding is the collective name for a set of language modeling and feature learning techniques aimed to map a single text-words into a fixed dimension real-valued vector space.  It is also often called distributed word representation.

Why do we need to encode words?
Many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform any sort of job, be it a classification, regression etc.

One of the ways to encode words is one-hot encoding. One-hot encoding is a vector representation where all the elements of the vector are 0 except one, which has 1 as its value. The main drawback of this type of representation is a large vector dimension.  Therefore, this approach will be too memory consuming. In addition, we can’t find the similarities between different words in this way as each word is represented differently and there is no way to compare them.

There is a better way to encode words- semantic word embedding. Vector representations of words (word embedding) try to capture relationships between words as distance or angle. The simplest methods use word vectors that explicitly represent co-occurrence statistics. Neural network language models propose another way to construct embedding: the word vector is simply the neural network's internal representation of the word.

In this lab notebook, we will compare the standard, pretrained word embedding (GLoVe, Word2Vec, fastText, ELMO) for SQuAD data. In particular, we are interested in the word coverage, that is the percentage of the SQuAD words in the word embedding. We will analyse the impact of different sentence normalization methods on the word coverage level. 
Every word embedding algorithm would be tested on all pretrained models in open access. 

Every pretrained model would be described by the set of features:
- name: a brief description of the source of tokens
- number of words in vocabulary: a number of unique tokens in vocabulary
- missing words(missing words percent): number(percent) of words from SQuAD vocabulary which are out of pretrained model vocabulary
- first loading time: time needed for pretrained model to be loaded

###  Testing sentence processing methods

As a first step of testing word embbedding methods we should divide SQuAD data into separate words. 

In [1]:
import sys
sys.path.append('../source_code/')
import utils as utils
import nltk
import spacy
spacy_nlp = spacy.load('en')
from textblob import TextBlob
from collections import Counter
import pandas as pd
import time
import string
import numpy as np
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.tag import StanfordNERTagger
import os
java_path = "C:/ProgramData/Oracle/Java/javapath/java.exe"
os.environ['JAVAHOME'] = java_path
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
%load_ext autoreload
%autoreload 2

  from ._conv import register_converters as _register_converters


[nltk_data] Downloading package words to
[nltk_data]     C:\Users\ForMaxwell\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ForMaxwell\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ForMaxwell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\ForMaxwell\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


In [28]:
#load SQuAD training set
training_set = pd.read_json("../data/squad/train-v2.0.json", encoding='utf-8')

#### Test named entities recognition

In order to correctly determine which entities should be lowercased we will use named entities recognition tools. 

The ways of NE recognition are described here
https://towardsdatascience.com/named-entity-recognition-3fad3f53c91e

In [3]:
#test data
text =  utils.get_squad_sentences(training_set[10:11])[0] + \
        utils.get_squad_sentences(training_set[10:11])[2] + \
        utils.get_squad_sentences(training_set[2:3])[0] + \
        utils.get_squad_sentences(training_set[2:3])[2] + \
        utils.get_squad_sentences(training_set[0:1])[0]
print(text)

#entities which must be uppercased
named_entities_list = ['Kanye', 'Omari', 'West', 'American', 'Chicago', 'Roc-A-Fella','Records','Jay-Z','Alicia', \
                       'Keys', 'Tibet', 'Ming', 'China', 'Mainland' , 'Chinese', 'Wang' , 'Jiawei', 'Nyima', 'Gyaincain', \
                       'Tibetan', 'Tibetans', 'Beyoncé', 'Giselle' , 'Knowles-Carter', 'September'];

words_count = len(text.split())
print('total test word count: {0}'.format(words_count))

NE_df = pd.DataFrame(columns=['name', 'FPR', 'Recall(TPR)', 'FNR', 'Precision'])
NE_df_number = 0

Kanye Omari West (/ˈkɑːnjeɪ/; born June 8, 1977) is an American hip hop recording artist, record producer, rapper, fashion designer, and entrepreneur.Raised in Chicago, West briefly attended art school before becoming known as a producer for Roc-A-Fella Records in the early 2000s, producing hit singles for artists such as Jay-Z and Alicia Keys.The exact nature of relations between Tibet and the Ming dynasty of China (1368–1644) is unclear.Some Mainland Chinese scholars, such as Wang Jiawei and Nyima Gyaincain, assert that the Ming dynasty had unquestioned sovereignty over Tibet, pointing to the Ming court's issuing of various titles to Tibetan leaders, Tibetans' full acceptance of these titles, and a renewal process for successors of these titles that involved traveling to the Ming capital.Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.
total test word count: 139


In [4]:
#using nltk ne_chunk
from nltk import Tree, pos_tag, ne_chunk
tagged_sent = ne_chunk(pos_tag(word_tokenize(text)))
named_entities = []

from nltk.sem.relextract import NE_CLASSES
ace_tags = NE_CLASSES['ace']

for node in tagged_sent:
     if type(node) == Tree and node.label() in ace_tags:
        words, tags = zip(*node.leaves())
        named_entities += words

named_entities = utils.remove_duplicates(named_entities)
print(named_entities)   

utils.get_accuracy(named_entities, named_entities_list, words_count, NE_df, NE_df_number, 'nltk')
NE_df_number += 1

['Chicago', 'Kanye', 'Tibet', 'Jiawei', 'Nyima', 'Chinese', 'Wang', 'Ming', 'Omari', 'Tibetans', 'American', 'China', 'Mainland', 'Gyaincain', 'Tibetan', 'West', 'Alicia', 'Giselle']
FPR: 0.000000
Recall(TPR): 0.720000
FNR: 0.280000
Precision: 1.000000


In [5]:
#using Stanford NER
ner_directory = 'C:/MRC/squad/stanford-ner-2018-02-27/'

stanford_ner_tagger = StanfordNERTagger(
    ner_directory + 'classifiers/english.all.3class.distsim.crf.ser.gz',
    ner_directory + 'stanford-ner-3.9.1.jar'
)

results = stanford_ner_tagger.tag(text.split())

named_entities = []
for result in results:
    tag_value = result[0]
    tag_type = result[1]
    if tag_type != 'O':
        named_entities.append(tag_value)
        
print(named_entities)

utils.get_accuracy(named_entities, named_entities_list, words_count, NE_df, NE_df_number, 'Stanford NER')
NE_df_number += 1

The StanfordTokenizer will be deprecated in version 3.2.5.
Please use nltk.tag.corenlp.CoreNLPPOSTagger or nltk.tag.corenlp.CoreNLPNERTagger instead.
  super(StanfordNERTagger, self).__init__(*args, **kwargs)


['Kanye', 'Omari', 'West', 'Chicago,', 'West', 'Roc-A-Fella', 'Records', 'Jay-Z', 'Alicia', 'Tibet', 'China', 'Wang', 'Jiawei']
FPR: 0.008772
Recall(TPR): 0.480000
FNR: 0.520000
Precision: 0.923077


In [6]:
#using spacy
document = spacy_nlp(text)

named_entities = []
NE_taggs = ['PERSON' , 'FAC', 'ORG', 'NORP', 'GPE', 'LOC', 'PRODUCT', 'LAW', 'LANGUAGE'] 
for element in document.ents:
    if (element.label_ in NE_taggs):
        named_entities += ([word.strip(string.punctuation) for word in element.text.split()])
    
print(named_entities)

utils.get_accuracy(named_entities, named_entities_list, words_count, NE_df, NE_df_number, 'spacy')
NE_df_number += 1

['Kanye', 'Omari', 'West', 'American', 'Chicago', 'Roc', 'Jay-Z', 'Alicia', 'Keys', 'Tibet', 'China', 'Chinese', 'Wang', 'Jiawei', 'Nyima', 'Gyaincain', 'Tibet', 'Ming', 'Tibetan', 'Tibetans', 'Ming', 'Giselle', 'Knowles-Carter', 'American']
FPR: 0.008772
Recall(TPR): 0.920000
FNR: 0.080000
Precision: 0.958333


In [7]:
NE_df

Unnamed: 0,name,FPR,Recall(TPR),FNR,Precision
0,nltk,0.0,0.72,0.28,1.0
1,Stanford NER,0.008772,0.48,0.52,0.923077
2,spacy,0.008772,0.92,0.08,0.958333


#### Conclusion

spacy has shown the best result on test sentence: if has highest  recall and rather high precision, lowerst FNR.

#### Test splitting text into separate sentences

In [8]:
print(text)

Kanye Omari West (/ˈkɑːnjeɪ/; born June 8, 1977) is an American hip hop recording artist, record producer, rapper, fashion designer, and entrepreneur.Raised in Chicago, West briefly attended art school before becoming known as a producer for Roc-A-Fella Records in the early 2000s, producing hit singles for artists such as Jay-Z and Alicia Keys.The exact nature of relations between Tibet and the Ming dynasty of China (1368–1644) is unclear.Some Mainland Chinese scholars, such as Wang Jiawei and Nyima Gyaincain, assert that the Ming dynasty had unquestioned sovereignty over Tibet, pointing to the Ming court's issuing of various titles to Tibetan leaders, Tibetans' full acceptance of these titles, and a renewal process for successors of these titles that involved traveling to the Ming capital.Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.


In [9]:
#regex tokenizer
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

In [10]:
#textBlob
sentences = TextBlob(text).sentences
print('text blob founded %i sentences in text' % len(sentences))

#nltk tokenizer
from nltk import tokenize
sentences = tokenize.sent_tokenize(text)
print('nltk tokenizer founded %i sentences in text' % len(sentences))

#spacy 
tokens = spacy_nlp(text)
sentences = []
for sent in tokens.sents:
    sentences.append(sent.string.strip())
    
print('spacy tokenizer founded %i sentences in text' % len(sentences))

sentences = split_into_sentences(text)
print('regex tokenizer founded %i sentences in text' % len(sentences))

text blob founded 1 sentences in text
nltk tokenizer founded 1 sentences in text
spacy tokenizer founded 6 sentences in text
regex tokenizer founded 5 sentences in text


#### Conclusion

The right count of sentences in text is 5, that is why regex tokenizer has shown the best result. We will use it.

#### Test splitting sentence into sepatare words

In [11]:
print(sentences)

['Kanye Omari West (/ˈkɑːnjeɪ/; born June 8, 1977) is an American hip hop recording artist, record producer, rapper, fashion designer, and entrepreneur.', 'Raised in Chicago, West briefly attended art school before becoming known as a producer for Roc-A-Fella Records in the early 2000s, producing hit singles for artists such as Jay-Z and Alicia Keys.', 'The exact nature of relations between Tibet and the Ming dynasty of China (1368–1644) is unclear.', "Some Mainland Chinese scholars, such as Wang Jiawei and Nyima Gyaincain, assert that the Ming dynasty had unquestioned sovereignty over Tibet, pointing to the Ming court's issuing of various titles to Tibetan leaders, Tibetans' full acceptance of these titles, and a renewal process for successors of these titles that involved traveling to the Ming capital.", 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.']


In [12]:
#nltk
nltk_words = []
for i in range(len(sentences)):
    words_tmp = [word.strip(string.punctuation) for word in nltk.word_tokenize(sentences[i])]
    nltk_words += list(filter(None, words_tmp))
print('nltk founded %i words' % len(nltk_words))

nltk founded 144 words


In [13]:
#string
string_words = []
for i in range(len(sentences)):
    string_words += [word.strip(string.punctuation) for word in sentences[i].split()]
print('string founded %i words' % len(string_words))

string founded 143 words


In [14]:
#shlex 
import shlex

shlex_words = []
for i in range(len(sentences)):
    shlex_words += shlex.split(sentences[i])
print('shlex founded %i words' % len(shlex_words))

shlex founded 135 words


In [15]:
#sta
import sta

sta_words = []
for i in range(len(sentences)):
    sta_words += sta(sentences[i])
print('sta founded %i words' % len(sta_words))

sta founded 143 words


In [16]:
#textblob
textblob_words = []
for i in range(len(sentences)):
    textblob_words += TextBlob(sentences[i]).words
print('textblob founded %i words' % len(textblob_words))

textblob founded 144 words


In [17]:
#compare textblob_words and sta_words
print(sentences)
from difflib import SequenceMatcher
for tag, i, j, k, l in SequenceMatcher(None, textblob_words, sta_words).get_opcodes():
    if tag == 'equal': print('both have', textblob_words[i:j])
    if tag in ('delete', 'replace'): print('  1st has', textblob_words[i:j])
    if tag in ('insert', 'replace'): print('  2nd has', sta_words[k:l])

['Kanye Omari West (/ˈkɑːnjeɪ/; born June 8, 1977) is an American hip hop recording artist, record producer, rapper, fashion designer, and entrepreneur.', 'Raised in Chicago, West briefly attended art school before becoming known as a producer for Roc-A-Fella Records in the early 2000s, producing hit singles for artists such as Jay-Z and Alicia Keys.', 'The exact nature of relations between Tibet and the Ming dynasty of China (1368–1644) is unclear.', "Some Mainland Chinese scholars, such as Wang Jiawei and Nyima Gyaincain, assert that the Ming dynasty had unquestioned sovereignty over Tibet, pointing to the Ming court's issuing of various titles to Tibetan leaders, Tibetans' full acceptance of these titles, and a renewal process for successors of these titles that involved traveling to the Ming capital.", 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.']
both have ['Kanye', 'Omari', 'We

As we can see, textblob split better than sta because textblob exclude useless punctuation

In [18]:
#compare textblob_words and shlex_words
print(sentences)
from difflib import SequenceMatcher
for tag, i, j, k, l in SequenceMatcher(None, textblob_words, shlex_words).get_opcodes():
    if tag == 'equal': print('both have', textblob_words[i:j])
    if tag in ('delete', 'replace'): print('  1st has', textblob_words[i:j])
    if tag in ('insert', 'replace'): print('  2nd has', shlex_words[k:l])

['Kanye Omari West (/ˈkɑːnjeɪ/; born June 8, 1977) is an American hip hop recording artist, record producer, rapper, fashion designer, and entrepreneur.', 'Raised in Chicago, West briefly attended art school before becoming known as a producer for Roc-A-Fella Records in the early 2000s, producing hit singles for artists such as Jay-Z and Alicia Keys.', 'The exact nature of relations between Tibet and the Ming dynasty of China (1368–1644) is unclear.', "Some Mainland Chinese scholars, such as Wang Jiawei and Nyima Gyaincain, assert that the Ming dynasty had unquestioned sovereignty over Tibet, pointing to the Ming court's issuing of various titles to Tibetan leaders, Tibetans' full acceptance of these titles, and a renewal process for successors of these titles that involved traveling to the Ming capital.", 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.']
both have ['Kanye', 'Omari', 'We

As we can see, textblob split better than sta because textblob exclude useless punctuation

In [19]:
#compare textblob_words and string_words
print(sentences)
from difflib import SequenceMatcher
for tag, i, j, k, l in SequenceMatcher(None, textblob_words, string_words).get_opcodes():
    if tag == 'equal': print('both have', textblob_words[i:j])
    if tag in ('delete', 'replace'): print('  1st has', textblob_words[i:j])
    if tag in ('insert', 'replace'): print('  2nd has', string_words[k:l])

['Kanye Omari West (/ˈkɑːnjeɪ/; born June 8, 1977) is an American hip hop recording artist, record producer, rapper, fashion designer, and entrepreneur.', 'Raised in Chicago, West briefly attended art school before becoming known as a producer for Roc-A-Fella Records in the early 2000s, producing hit singles for artists such as Jay-Z and Alicia Keys.', 'The exact nature of relations between Tibet and the Ming dynasty of China (1368–1644) is unclear.', "Some Mainland Chinese scholars, such as Wang Jiawei and Nyima Gyaincain, assert that the Ming dynasty had unquestioned sovereignty over Tibet, pointing to the Ming court's issuing of various titles to Tibetan leaders, Tibetans' full acceptance of these titles, and a renewal process for successors of these titles that involved traveling to the Ming capital.", 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.']
both have ['Kanye', 'Omari', 'We

As we can see, string split better than textblob because the word 'court's' shouldn't be separated

In [20]:
#compare string_words and nltk_words
print(sentences)
from difflib import SequenceMatcher
for tag, i, j, k, l in SequenceMatcher(None, string_words, nltk_words).get_opcodes():
    if tag == 'equal': print('both have', string_words[i:j])
    if tag in ('delete', 'replace'): print('  1st has', string_words[i:j])
    if tag in ('insert', 'replace'): print('  2nd has', nltk_words[k:l])

['Kanye Omari West (/ˈkɑːnjeɪ/; born June 8, 1977) is an American hip hop recording artist, record producer, rapper, fashion designer, and entrepreneur.', 'Raised in Chicago, West briefly attended art school before becoming known as a producer for Roc-A-Fella Records in the early 2000s, producing hit singles for artists such as Jay-Z and Alicia Keys.', 'The exact nature of relations between Tibet and the Ming dynasty of China (1368–1644) is unclear.', "Some Mainland Chinese scholars, such as Wang Jiawei and Nyima Gyaincain, assert that the Ming dynasty had unquestioned sovereignty over Tibet, pointing to the Ming court's issuing of various titles to Tibetan leaders, Tibetans' full acceptance of these titles, and a renewal process for successors of these titles that involved traveling to the Ming capital.", 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress.']
both have ['Kanye', 'Omari', 'We

As we can see, the difference between nltk and string splitter is the way of splitting the word 'court's'

#### Conclusion

The best way to split a sentence into separate words is using the nltk. If the ''s' is separated, a neural network trained on this dictionary can learn its semantic, meaning that it indicates possession. If the 'courd's' stays together, then for every single noun in the English dictionary, there will be another token for the possession, like apple -> apple's. Moreover, if there are no words like apple's in the training data, a neural network trained on this dictionary will never generalize to it.

#### Checking nltk string splitting features, improving results

In [21]:
#new test text
text =  utils.get_squad_sentences(training_set[0:1])[28:29] + \
        utils.get_squad_sentences(training_set[0:1])[28:29] + \
        utils.get_squad_sentences(training_set[3:4])[28:29] + \
        utils.get_squad_sentences(training_set[7:8])[0:1] + \
        utils.get_squad_sentences(training_set[10:11])[10:11] + \
        utils.get_squad_sentences(training_set[0:1])[0:2]

print('Text for testing nltk string split results: ', text)   

nltk_words = []

for i in range(len(text)):
    words_tmp = [word.strip(string.punctuation) for word in nltk.word_tokenize(text[i])]
    nltk_words += list(filter(None, words_tmp))
print('nltk founded %i words: ' % len(nltk_words), nltk_words)

Text for testing nltk string split results:  ['Beyoncé\'s interest in music and performing continued after winning a school talent show at age seven, singing John Lennon\'s "Imagine" to beat 15/16-year-olds.', 'Beyoncé\'s interest in music and performing continued after winning a school talent show at age seven, singing John Lennon\'s "Imagine" to beat 15/16-year-olds.', "Later iPods switched fonts again to Podium Sans—a font similar to Apple's corporate font, Myriad.", 'New York—often called New York City or the City of New York to distinguish it from the State of New York, of which it is a part—is the most populous city in the United States and the center of the New York metropolitan area, the premier gateway for legal immigration to the United States and one of the most populous urban agglomerations in the world.', 'Three of his albums rank on Rolling Stone\'s 2012 "500 Greatest Albums of All Time" list; two of his albums feature at first and eighth, respectively, in Pitchfork Media

Conclusion. Some words, which contains non-letter tokens, were not splitted: '15/16-year-olds', 'Sans—a', 'York—often', 'part—is', '2010–2014'. All of these words would not be covered by word embedding algorithms. In order to avoid this incorrect behavior let's try using regex to split words in conjunction with nltk.

In [22]:
import re
nltk_words_re = []
non_letter_regex = '[\(\)\[\]:;–—/\\\\]+'
for i in range(len(text)):
    words_tmp = [word.strip(string.punctuation) for word in nltk.word_tokenize(text[i])]
    words_tmp_regex = []
    for j in range(len(words_tmp)):
        words_tmp_regex += re.split(non_letter_regex, words_tmp[j])
    nltk_words_re += list(filter(None, words_tmp_regex))
print('nltk+regex founded %i words: ' % len(nltk_words_re), nltk_words_re)

nltk+regex founded 227 words:  ['Beyoncé', 's', 'interest', 'in', 'music', 'and', 'performing', 'continued', 'after', 'winning', 'a', 'school', 'talent', 'show', 'at', 'age', 'seven', 'singing', 'John', 'Lennon', 's', 'Imagine', 'to', 'beat', '15', '16-year-olds', 'Beyoncé', 's', 'interest', 'in', 'music', 'and', 'performing', 'continued', 'after', 'winning', 'a', 'school', 'talent', 'show', 'at', 'age', 'seven', 'singing', 'John', 'Lennon', 's', 'Imagine', 'to', 'beat', '15', '16-year-olds', 'Later', 'iPods', 'switched', 'fonts', 'again', 'to', 'Podium', 'Sans', 'a', 'font', 'similar', 'to', 'Apple', 's', 'corporate', 'font', 'Myriad', 'New', 'York', 'often', 'called', 'New', 'York', 'City', 'or', 'the', 'City', 'of', 'New', 'York', 'to', 'distinguish', 'it', 'from', 'the', 'State', 'of', 'New', 'York', 'of', 'which', 'it', 'is', 'a', 'part', 'is', 'the', 'most', 'populous', 'city', 'in', 'the', 'United', 'States', 'and', 'the', 'center', 'of', 'the', 'New', 'York', 'metropolitan', 'a

In [23]:
from difflib import SequenceMatcher
for tag, i, j, k, l in SequenceMatcher(None, nltk_words, nltk_words_re).get_opcodes():
    if tag == 'equal': print('both have', nltk_words[i:j])
    if tag in ('delete', 'replace'): print('  1st has', nltk_words[i:j])
    if tag in ('insert', 'replace'): print('  2nd has', nltk_words_re[k:l])

both have ['Beyoncé', 's', 'interest', 'in', 'music', 'and', 'performing', 'continued', 'after', 'winning', 'a', 'school', 'talent', 'show', 'at', 'age', 'seven', 'singing', 'John', 'Lennon', 's', 'Imagine', 'to', 'beat']
  1st has ['15/16-year-olds']
  2nd has ['15', '16-year-olds']
both have ['Beyoncé', 's', 'interest', 'in', 'music', 'and', 'performing', 'continued', 'after', 'winning', 'a', 'school', 'talent', 'show', 'at', 'age', 'seven', 'singing', 'John', 'Lennon', 's', 'Imagine', 'to', 'beat']
  1st has ['15/16-year-olds']
  2nd has ['15', '16-year-olds']
both have ['Later', 'iPods', 'switched', 'fonts', 'again', 'to', 'Podium']
  1st has ['Sans—a']
  2nd has ['Sans', 'a']
both have ['font', 'similar', 'to', 'Apple', 's', 'corporate', 'font', 'Myriad', 'New']
  1st has ['York—often']
  2nd has ['York', 'often']
both have ['called', 'New', 'York', 'City', 'or', 'the', 'City', 'of', 'New', 'York', 'to', 'distinguish', 'it', 'from', 'the', 'State', 'of', 'New', 'York', 'of', 'whic

Conclusion. For sentence splitting, we will use nltk + regex. This approach allows dividing sentence even if it has uncommon separators, like [] () / and o on.

#### Split SQuAD data to separate words

As a first step we will split text to sepatare sentences with regex tokenizer.

In [29]:
sentences = utils.get_squad_sentences(training_set)
print("The number of sentences in SQuAD context and question sections is : {}".format(len(sentences)))

The number of sentences in SQuAD context and question sections is : 185169


In [30]:
start_time = time.time()
squad_words_and_freqs = utils.get_words_and_freqs(sentences)
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

100%|████████████████████████████████| 185169/185169 [1:28:51<00:00, 34.73it/s]


Done. Time: 5335.73 sec.


In [31]:
print("The vocabulary contains {} unique tokens".format(len(squad_words_and_freqs.keys())))

The vocabulary contains 109584 unique tokens


In [99]:
squad_words_and_freqs_uncased = utils.get_uncased_word_map(squad_words_and_freqs)
print("The vocabulary contains {} unique uncased words".format(len(squad_words_and_freqs_uncased.keys())))

The vocabulary contains 95465 unique uncased words


In [32]:
print('200 separate words from SQuAD')
utils.get_word_list(squad_words_and_freqs)[0:200]

200 separate words from SQuAD


['Giselle',
 'Knowles-Carter',
 'American',
 'beyoncé',
 'biːˈjɒnseɪ',
 'bee-yon-say',
 'born',
 'september',
 '4',
 '1981',
 'is',
 'an',
 'singer',
 'songwriter',
 'record',
 'producer',
 'and',
 'actress',
 'Houston',
 'Texas',
 'R',
 '',
 'B',
 'Child',
 'raised',
 'in',
 'she',
 'performed',
 'various',
 'singing',
 'dancing',
 'competitions',
 'as',
 'a',
 'child',
 'rose',
 'to',
 'fame',
 'the',
 'late',
 '1990s',
 'lead',
 'of',
 'girl-group',
 'destiny',
 's',
 'Mathew',
 'Knowles',
 'managed',
 'by',
 'her',
 'father',
 'group',
 'became',
 'one',
 'world',
 'best-selling',
 'girl',
 'groups',
 'all',
 'time',
 'Grammy',
 'Awards',
 'their',
 'hiatus',
 'saw',
 'release',
 'debut',
 'album',
 'dangerously',
 'love',
 '2003',
 'which',
 'established',
 'solo',
 'artist',
 'worldwide',
 'earned',
 'five',
 'featured',
 'billboard',
 'hot',
 '100',
 'number-one',
 'singles',
 'crazy',
 'baby',
 'boy',
 "B'Day",
 'following',
 'disbandment',
 'june',
 '2005',
 'released',
 'seco

In [101]:
#words with only english words
squad_words_and_freqs_eng = utils.remove_non_latin_words_from_map(squad_words_and_freqs)

In [34]:
print('200 separate words with latin letters from SQuAD')
utils.get_word_list(squad_words_and_freqs_eng)[0:200]

200 separate words with latin letters from SQuAD


['Giselle',
 'Knowles-Carter',
 'American',
 'bee-yon-say',
 'born',
 'september',
 '4',
 '1981',
 'is',
 'an',
 'singer',
 'songwriter',
 'record',
 'producer',
 'and',
 'actress',
 'Houston',
 'Texas',
 'R',
 '',
 'B',
 'Child',
 'raised',
 'in',
 'she',
 'performed',
 'various',
 'singing',
 'dancing',
 'competitions',
 'as',
 'a',
 'child',
 'rose',
 'to',
 'fame',
 'the',
 'late',
 '1990s',
 'lead',
 'of',
 'girl-group',
 'destiny',
 's',
 'Mathew',
 'Knowles',
 'managed',
 'by',
 'her',
 'father',
 'group',
 'became',
 'one',
 'world',
 'best-selling',
 'girl',
 'groups',
 'all',
 'time',
 'Grammy',
 'Awards',
 'their',
 'hiatus',
 'saw',
 'release',
 'debut',
 'album',
 'dangerously',
 'love',
 '2003',
 'which',
 'established',
 'solo',
 'artist',
 'worldwide',
 'earned',
 'five',
 'featured',
 'billboard',
 'hot',
 '100',
 'number-one',
 'singles',
 'crazy',
 'baby',
 'boy',
 "B'Day",
 'following',
 'disbandment',
 'june',
 '2005',
 'released',
 'second',
 '2006',
 'contained',

In [35]:
print("The vocabulary contains {} unique words with latin letters".format(len(squad_words_and_freqs_eng.keys())))

The vocabulary contains 104455 unique words with latin letters


In [103]:
squad_words_and_freqs_eng_uncased = utils.remove_non_latin_words_from_map(squad_words_and_freqs_uncased)

In [104]:
utils.save_dictionary_to_excel(squad_words_and_freqs_eng_uncased, "../data/squad/word_freqs_eng_uncased1.xlsx")

In [38]:
print("The vocabulary contains {} unique uncased words with latin letters".format(len(squad_words_and_freqs_eng_uncased.keys())))

The vocabulary contains 90601 unique uncased words with latin letters


In [37]:
#save all results to excel
utils.save_dictionary_to_excel(squad_words_and_freqs, "../data/squad/word_freqs.xlsx")
utils.save_dictionary_to_excel(squad_words_and_freqs_uncased, "../data/squad/word_freqs_uncased.xlsx")
utils.save_dictionary_to_excel(squad_words_and_freqs_eng, "../data/squad/word_freqs_eng.xlsx")
utils.save_dictionary_to_excel(squad_words_and_freqs_eng_uncased, "../data/squad/word_freqs_eng_uncased.xlsx")

# Embeddings

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query.

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. 

https://github.com/vzhong/embeddings

In [40]:
from embeddings import GloveEmbedding, FastTextEmbedding

Results of word embedding comparison would be in result_df table. 

- name: short name od word embedding pretrained dataset 
- size(Mb): size of downloaded ptetrained model 
- number of words in training: total number of unique words in training dataset of pretrained model 
- number of unique tokens: total number of words in vocabulary of pretrained model 
- cased: are words in vocabulary cased 
- source of words: short name of source 
- made by: short name of author/company 
- missing words(%): percent of words from SQuAD dataset which are not in pretrained dataset 
- embedding coverage(%): percent of words from SQuAD dataset which are in pretrained dataset 
- missing words with latin letters %: percent of words with latin letters from SQuAD dataset which are not in pretrained dataset 
- embedding coverage with latin letters %: percent of words with latin letters from SQuAD dataset which are in pretrained dataset 
- first loading time(hours:min:sec): time for download pretrained model and fill sqlite database 

When every pretrained model is loaded the first time, loading time = time of downloading pretrained model + time of creating a database. When a model is reloaded, loading time is much less because we need only to load a database of vocabulary and, optionally, embedded vectors.

In [41]:
results_df = pd.DataFrame(columns=['name', 'size(Mb)', 'number of words in training', 'number of unique tokens',
                                   'cased', 'source of words', 'made by', 'missing words %', 'embedding coverage %',
                                   'missing words with latin letters %', 'embedding coverage with latin letters %',
                                   'first loading time hours/min/sec'])

We fill this dataframe with: word, its frequency, existence in WE algorithm

In [38]:
squad_words_and_freqs_df = pd.DataFrame(list(squad_words_and_freqs.items()), columns=['word', 'frequency'])
squad_words_and_freqs_uncased_df =  pd.DataFrame(list(squad_words_and_freqs_uncased.items()), columns=['word', 'frequency'])
squad_words_and_freqs_eng_df =  pd.DataFrame(list(squad_words_and_freqs_eng.items()), columns=['word', 'frequency'])
squad_words_and_freqs_eng_uncased_df =  pd.DataFrame(list(squad_words_and_freqs_eng_uncased.items()), columns=['word', 'frequency'])

## GloVe

GloVe(Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

https://nlp.stanford.edu/projects/glove/

Pre-trained word vectors:

- Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip
- Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip
- Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip
- Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): glove.6B.zip

### Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.126 GB download): glove.840B.300d.zip

https://resources.wolframcloud.com/NeuralNetRepository/resources/GloVe-300-Dimensional-Word-Vectors-Trained-on-Common-Crawl-840B

This model encodes 2,196,016 tokens as unique vectors, with all tokens outside the vocabulary encoded as the zero-vector. It was released in 2014 by the computer science department at Stanford University. This model uses Web data from Common Crawl, trained on 840 billion tokens, taking into account the case. All vectors are 300-dimensional 

In [41]:
#initial load time= 1:46:38
start_time = time.time()
glove_common_crawl_840 = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 121.72 sec.


In [46]:
words_out_of_model_glove_common_crawl_840 = utils.get_missing_words('GloVe common_crawl_840', glove_common_crawl_840, 
                                                                          utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109584/109584 [15:08<00:00, 120.64it/s]


words out of GloVe common_crawl_840 : 20171, percentage of successful word embedding 81.59311578332603%, unsuccessful 18.406884216673966%
Done. Time: 908.6 sec.


In [42]:
squad_words_and_freqs_df['GloVe common_crawl_840'] = utils.get_word_existence_for_WE(glove_common_crawl_840,
                                                                                     utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109584/109584 [13:01<00:00, 140.25it/s]


In [50]:
eng_words_out_of_model_glove_common_crawl_840 = utils.get_missing_words('GloVe common_crawl_840', glove_common_crawl_840, 
                                                                          utils.get_word_list(squad_words_and_freqs_eng))

100%|████████████████████████████████| 104455/104455 [00:11<00:00, 9030.70it/s]


words out of GloVe common_crawl_840 : 16374, percentage of successful word embedding 84.32435019865014%, unsuccessful 15.675649801349865%
Done. Time: 11.57 sec.


In [43]:
squad_words_and_freqs_eng_df['GloVe common_crawl_840'] = utils.get_word_existence_for_WE(glove_common_crawl_840,
                                                                        utils.get_word_list(squad_words_and_freqs_eng))

100%|████████████████████████████████| 104455/104455 [00:11<00:00, 9153.36it/s]


In [52]:
print('uncovered word examples: {0}'.format(eng_words_out_of_model_glove_common_crawl_840[0:30]))

uncovered word examples: ['bee-yon-say', 'Darlette', 'elsik', 'best-charting', '663000', '317000', '541000', '482000', 'number-ones', 'romance-themed', 'electro-r', 'Lifeandtimescom', 'TEDxEuston', 'goose-bump-inducing', 'diva-roars', 'irreemplazable', 'female-empowerment', 'man-tending', '11-motivated', 'mini-hula', 'Cooper-Donnell', 'beyontourage', 'Llewyn-Smith', 'Scaptia', 'beyonceae', 'CSPINET', 'GateFive', 'food-donation', "itwasannouncedthatdestiny'swoulddisbaninwhatcity", 'pareles']


In [54]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng))
results_df.loc[len(results_df)]= ['GloVe common_crawl_840', 2126, '840 billion', '2 196 016', 
                                  'cased', 'Common Crawl', 'Stanford Univ',
                                   (len(words_out_of_model_glove_common_crawl_840) / squad_words_cnt * 100),
                                   (100 -  len(words_out_of_model_glove_common_crawl_840) / squad_words_cnt * 100), 
                                   (len(eng_words_out_of_model_glove_common_crawl_840) / squad_words_en_cnt * 100),
                                   (100 -  len(eng_words_out_of_model_glove_common_crawl_840) / squad_words_en_cnt * 100), 
                                   '1:46:38']

In [44]:
#save results with word existence
squad_words_and_freqs_df.to_excel("../data/squad/word_freqs.xlsx")
squad_words_and_freqs_eng_df.to_excel("../data/squad/word_freqs_eng.xlsx")

### Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip

https://resources.wolframcloud.com/NeuralNetRepository/resources/GloVe-300-Dimensional-Word-Vectors-Trained-on-Common-Crawl-42B

This model encodes 1,917,495 tokens as unique vectors, with all tokens outside the vocabulary encoded as the zero-vector. It was released in 2014 by the computer science department at Stanford University. This model uses Web data from Common Crawl, trained on 42 billion tokens. All tokens are uncased. All vectors are 300-dimensional 

In [45]:
#reload time
start_time = time.time()
glove_common_crawl_42 = GloveEmbedding('common_crawl_48', d_emb=300, show_progress=True)
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 152.65 sec.


In [57]:
words_out_of_model_common_crawl_42 = utils.get_missing_words('GloVe common_crawl_42', glove_common_crawl_42, 
                                                                   utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████████| 95465/95465 [12:26<00:00, 127.95it/s]


words out of GloVe common_crawl_42 : 15809, percentage of successful word embedding 83.44000419001728%, unsuccessful 16.559995809982716%
Done. Time: 746.12 sec.


In [46]:
squad_words_and_freqs_uncased_df['GloVe common_crawl_42'] = utils.get_word_existence_for_WE(glove_common_crawl_42,
                                                                         utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████████| 95465/95465 [12:13<00:00, 130.07it/s]


In [60]:
eng_words_out_of_model_common_crawl_42 = utils.get_missing_words('GloVe common_crawl_42', glove_common_crawl_42, 
                                                                   utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|██████████████████████████████████| 90601/90601 [00:09<00:00, 9694.58it/s]


words out of GloVe common_crawl_42 : 12306, percentage of successful word embedding 86.41736846171676%, unsuccessful 13.582631538283241%
Done. Time: 9.35 sec.


In [47]:
squad_words_and_freqs_eng_uncased_df['GloVe common_crawl_42'] = utils.get_word_existence_for_WE(glove_common_crawl_42,
                                                                         utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|██████████████████████████████████| 90601/90601 [00:09<00:00, 9813.24it/s]


In [62]:
print('uncovered word examples: {0}'.format(eng_words_out_of_model_common_crawl_42[0:200]))

uncovered word examples: ['bee-yon-say', '', 'darlette', 'best-charting', '663000', '317000', '541000', '482000', 'number-ones', 'electro-r', 'lifeandtimescom', 'tedxeuston', 'goose-bump-inducing', 'diva-roars', 'female-empowerment', 'man-tending', '11-motivated', 'mini-hula', 'cooper-donnell', 'beyontourage', 'llewyn-smith', 'scaptia', 'beyonceae', 'cspinet', 'gatefive', 'food-donation', "itwasannouncedthatdestiny'swoulddisbaninwhatcity", 'wholistedheratnumber17intheirlistoftop20hot100', 'advetisments', 'bayonce', 'whenwasitannouncedthatwasaco-ownerin', 'skarbeks', 'belweder', 'ursyn', 'niemcewicz', 'przebiegi', 'eolomelodicon', 'szafarnia', 'dziewanowski', 'salonik', 'mieroszewski', 'woyciechowski', 'nepomucen', 'witwicki', 'konstancja', 'piano-bashing', 'jachimecki', 'concert-giving', 'bendemann', 'amantine', 'mallefille', 'canuts', 'gestirne', 'fortune-hunting', 'bozzolini', 'prosseda', 'blanchar', 'methuen-campbell', 'obreskoff', 'czartoryska', 'cruveilhier', 'jeanne-anais', 'fior

Conclusion: 
- even after removing words which contain non-latin letters there are still some non-english words written with English letters, like 'literaturna'(Bulgarian), 'hongzhu'(Chinese). 
- there are some words with typo errors, like 'culturural'(maby there should be 'cultural'), 'desagreements' (maby there should be 'disagreements'), 'percentsge'(maby there should be 'percentage'). 
- besides this, it seems that word embedding algorithms can't encode float numbers like '638,817' or big numbers with ',' as separator like '1,698,465'- maby such numbers should be sepatared by ','. 
- there are some custom constructions like '50-xxxx', 'emf2', '20-cwt', which are unknown in dictionary. 
- there are some rare English words, like 'intendancy', 'anti-monarchists', which are not covered 
- there are some names  which were not marked as named entities, like 'Vseslav', 'Demira'
- there are some incorrectly parsed words like '520-485', 'n=279', '9,192,631,770'


In [63]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs_uncased))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng_uncased))
results_df.loc[len(results_df)] = ['GloVe common_crawl_42', 1833, '42 billion', '1 917 495', 
                                    'uncased', 'Common Crawl', 'Stanford Univ',
                                    len(words_out_of_model_common_crawl_42) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_common_crawl_42) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_common_crawl_42) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_common_crawl_42) / squad_words_en_cnt * 100,
                                    '1:30:10']

In [48]:
#save results with word existence
squad_words_and_freqs_uncased_df.to_excel("../data/squad/word_freqs_uncased.xlsx")
squad_words_and_freqs_eng_uncased_df.to_excel("../data/squad/word_freqs_eng_uncased.xlsx")

### Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip

https://resources.wolframcloud.com/NeuralNetRepository/resources/GloVe-200-Dimensional-Word-Vectors-Trained-on-Tweets

This model encodes 1,193,515 tokens as unique vectors, with all tokens outside the vocabulary encoded as the zero-vector. It was released in 2014 by the computer science department at Stanford University. This model uses Tweets, trained on 27 billion tweets. All tokens are uncased. All vectors are 200-dimensional.

In [49]:
#reload time
start_time = time.time()
glove_twitter = GloveEmbedding('twitter', d_emb=200, show_progress=True)
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 57.21 sec.


In [67]:
words_out_of_model_glove_twitter = utils.get_missing_words('GloVe twitter', glove_twitter, 
                                                           utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████████| 95465/95465 [07:33<00:00, 210.68it/s]


words out of GloVe twitter : 44375, percentage of successful word embedding 53.516995757607496%, unsuccessful 46.483004242392504%
Done. Time: 453.14 sec.


In [50]:
squad_words_and_freqs_uncased_df['GloVe twitter'] = utils.get_word_existence_for_WE(glove_twitter,
                                                                         utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████████| 95465/95465 [07:47<00:00, 204.39it/s]


In [69]:
eng_words_out_of_model_glove_twitter = utils.get_missing_words('GloVe twitter', glove_twitter, 
                                                               utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|█████████████████████████████████| 90601/90601 [00:07<00:00, 11582.18it/s]


words out of GloVe twitter : 40749, percentage of successful word embedding 55.02367523537268%, unsuccessful 44.97632476462732%
Done. Time: 7.83 sec.


In [51]:
squad_words_and_freqs_eng_uncased_df['GloVe twitter'] = utils.get_word_existence_for_WE(glove_twitter,
                                                                     utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|█████████████████████████████████| 90601/90601 [00:07<00:00, 11981.99it/s]


In [71]:
print('uncovered word examples: {0}'.format(eng_words_out_of_model_glove_twitter[0:30]))

uncovered word examples: ['bee-yon-say', '4', '1981', '1990s', 'girl-group', '2003', '100', 'disbandment', '2005', '2006', '2009', '2008', '2010', '2011', 'mellower', '1970s', '1980s', '2013', '19', '118', '60', '20', '2000s', '2014', '2015', 'darlette', '15', '16-year-olds', '1990', 'frager']


In the list of uncovered word there is "creek),[citation". It means that training set could be splitted to separate words better if we add some regular expressions to string tokenizer.

In [72]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs_uncased))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng_uncased))
results_df.loc[len(results_df)] = ['GloVe twitter', 1484, '27 billion', '1 193 515', 
                                    'uncased', 'Tweets', 'Stanford Univ',
                                    len(words_out_of_model_glove_twitter) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_glove_twitter) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_glove_twitter) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_glove_twitter) / squad_words_en_cnt * 100,
                                    '00:27:06']

In [52]:
#save results with word existence
squad_words_and_freqs_uncased_df.to_excel("../data/squad/word_freqs_uncased.xlsx")
squad_words_and_freqs_eng_uncased_df.to_excel("../data/squad/word_freqs_eng_uncased.xlsx")

### Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download): glove.6B.zip

https://resources.wolframcloud.com/NeuralNetRepository/resources/GloVe-300-Dimensional-Word-Vectors-Trained-on-Wikipedia-and-Gigaword-5-Data

This model encodes 400,000 tokens as unique vectors, with all tokens outside the vocabulary encoded as the zero-vector. It was released in 2014 by the computer science department at Stanford University. This model uses a combination of the Wikipedia 2014 dump and the Gigaword 5 corpus, trained on 6 billion tokens. All tokens are uncased. All vectors are 200-dimensional.

In [53]:
#reload time
start_time = time.time()
glove_wikipedia_gigaword = GloveEmbedding('wikipedia_gigaword', d_emb=300, show_progress=True)
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 16.14 sec.


In [75]:
words_out_of_model_glove_wikipedia_gigaword = utils.get_missing_words('GloVe wikipedia_gigaword', 
                                                glove_wikipedia_gigaword, utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████████| 95465/95465 [06:37<00:00, 240.11it/s]


words out of GloVe wikipedia_gigaword : 24175, percentage of successful word embedding 74.67658304090504%, unsuccessful 25.323416959094956%
Done. Time: 397.6 sec.


In [54]:
squad_words_and_freqs_uncased_df['GloVe wikipedia_gigaword'] = utils.get_word_existence_for_WE(glove_wikipedia_gigaword,
                                                         utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████████| 95465/95465 [08:53<00:00, 178.95it/s]


In [77]:
eng_words_out_of_model_glove_wikipedia_gigaword = utils.get_missing_words('GloVe wikipedia_gigaword', 
                                              glove_wikipedia_gigaword, utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|█████████████████████████████████| 90601/90601 [00:08<00:00, 11014.69it/s]


words out of GloVe wikipedia_gigaword : 20576, percentage of successful word embedding 77.28943389145815%, unsuccessful 22.710566108541848%
Done. Time: 8.27 sec.


In [55]:
squad_words_and_freqs_eng_uncased_df['GloVe wikipedia_gigaword'] = utils.get_word_existence_for_WE(glove_wikipedia_gigaword,
                                                                 utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|█████████████████████████████████| 90601/90601 [00:07<00:00, 11398.57it/s]


In [79]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs_uncased))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng_uncased))
results_df.loc[len(results_df)] = ['GloVe wikipedia_gigaword', 841, '6 billion', '400 000', 
                                    'uncased', 'Wikipedia 2014 + Gigaword 5', 'Stanford Univ',
                                    len(words_out_of_model_glove_wikipedia_gigaword) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_glove_wikipedia_gigaword) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_glove_wikipedia_gigaword) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_glove_wikipedia_gigaword) / squad_words_en_cnt * 100,
                                    '00:03:58']

In [80]:
results_df

Unnamed: 0,name,size(Mb),number of words in training,number of unique tokens,cased,source of words,made by,missing words %,embedding coverage %,missing words with latin letters %,embedding coverage with latin letters %,first loading time hours/min/sec
0,GloVe common_crawl_840,2126,840 billion,2 196 016,cased,Common Crawl,Stanford Univ,18.406884,81.593116,15.67565,84.32435,1:46:38
1,GloVe common_crawl_42,1833,42 billion,1 917 495,uncased,Common Crawl,Stanford Univ,16.559996,83.440004,13.582632,86.417368,1:30:10
2,GloVe twitter,1484,27 billion,1 193 515,uncased,Tweets,Stanford Univ,46.483004,53.516996,44.976325,55.023675,00:27:06
3,GloVe wikipedia_gigaword,841,6 billion,400 000,uncased,Wikipedia 2014 + Gigaword 5,Stanford Univ,25.323417,74.676583,22.710566,77.289434,00:03:58


In [56]:
#save results with word existence
squad_words_and_freqs_uncased_df.to_excel("../data/squad/word_freqs_uncased.xlsx")
squad_words_and_freqs_eng_uncased_df.to_excel("../data/squad/word_freqs_eng_uncased.xlsx")

GloVe Wikipedia gigaword was first time loaded faster than other models because of less downloading size (less than 1Gb) and less number of words in vocabulary (we have to prepare only 400 000 records in the database in comparing to more than 1 million for other datasets)

In [355]:
test_words = ['123', '3', 'cyclotron', 'Knowles-Carter', 'New', 'York', 'New York']
test_words_lowercased = ['123', '3', 'cyclotron', 'knowles-carter', 'new', 'york', 'new york']
glove_model_names = ['crawl_840','crawl_42','twitter','wikipedia']
glove_model_case = ['cased','uncased','uncased','uncased']
glove_models = [glove_common_crawl_840, glove_common_crawl_42, glove_twitter, glove_wikipedia_gigaword]

test_df = utils.process_test_words(glove_model_names, glove_model_case, test_words, test_words_lowercased, glove_models)
test_df

Unnamed: 0,name,123,3,cyclotron,Knowles-Carter,New,York,New York
0,crawl_840,in dict,in dict,in dict,in dict,in dict,in dict,not in dict
1,crawl_42,in dict,in dict,in dict,in dict,in dict,in dict,not in dict
2,twitter,not in dict,not in dict,not in dict,in dict,in dict,in dict,not in dict
3,wikipedia,in dict,in dict,in dict,not in dict,in dict,in dict,not in dict


#### Conclusion

The best pretrained model among GloVe models is GloVe common_crawl_42	as it has the less percentage of uncovered SQuAD words. Little percent of covered words for wikipedia could be connected with the fact that all numbers and some rare words are uncovered by this model.

## Fasttext

The fastText word embedding is contributed by the same group of people who established word2vec. It extends word2vec by introducing subword modeling. It represents each word was a bag of character n-gram.

Fasttext is a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.

For .bin use: load_fasttext_format() (this typically contains a full model with parameters, ngrams, etc.).

For .vec use: load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update a model).

https://fasttext.cc/docs/en/english-vectors.html

Pre-trained word vectors learned on different sources can be downloaded below:

- Wiki word vectors
- wiki-news-300d-1M.vec.zip: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
- wiki-news-300d-1M-subword.vec.zip: 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
- crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens).
- crawl-300d-2M-subword.zip: 2 million word vectors trained with subword information on Common Crawl (600B tokens).

### Wiki word vectors

https://fasttext.cc/docs/en/pretrained-vectors.html
    
Vectors trained on Wikipedia using fastText. 
These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) 
with default parameters.

In [57]:
#reload time
start_time = time.time()
fasttext_wiki = FastTextEmbedding()
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 148.07 sec.


In [83]:
words_out_of_model_fasttext_wiki = utils.get_missing_words('Fasttext fasttext_wiki', 
                                                            fasttext_wiki, utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████████| 95465/95465 [10:17<00:00, 154.58it/s]


words out of Fasttext fasttext_wiki : 21531, percentage of successful word embedding 77.44618446551092%, unsuccessful 22.55381553448908%
Done. Time: 617.6 sec.


In [58]:
squad_words_and_freqs_uncased_df['Fasttext fasttext_wiki'] = utils.get_word_existence_for_WE(fasttext_wiki,
                                                         utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████████| 95465/95465 [11:33<00:00, 137.57it/s]


In [85]:
eng_words_out_of_model_fasttext_wiki = utils.get_missing_words('Fasttext fasttext_wiki', 
                                                    fasttext_wiki,  utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|█████████████████████████████████| 90601/90601 [00:07<00:00, 11922.07it/s]


words out of Fasttext fasttext_wiki : 19286, percentage of successful word embedding 78.71325923554927%, unsuccessful 21.286740764450723%
Done. Time: 7.61 sec.


In [59]:
squad_words_and_freqs_eng_uncased_df['Fasttext fasttext_wiki'] = utils.get_word_existence_for_WE(fasttext_wiki,
                                                                 utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|█████████████████████████████████| 90601/90601 [00:07<00:00, 11545.28it/s]


In [87]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs_uncased))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng_uncased))
results_df.loc[len(results_df)] = ['fasttext fasttext_wiki', 10114, 'more than 100 millions', '2 518 927', 
                                    'uncased', 'Wikipedia', 'P. Bojanowski*, E. Grave, A. Joulin, T. Mikolov',
                                    len(words_out_of_model_fasttext_wiki) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_fasttext_wiki) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_fasttext_wiki) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_fasttext_wiki) / squad_words_en_cnt * 100,
                                    '12:10:02']

In [60]:
#save results with word existence
squad_words_and_freqs_uncased_df.to_excel("../data/squad/word_freqs_uncased.xlsx")
squad_words_and_freqs_eng_uncased_df.to_excel("../data/squad/word_freqs_eng_uncased.xlsx")

In [61]:
from word_emb import Word_Emb



### wiki-news-300d-1M.vec

https://fasttext.cc/docs/en/english-vectors.html

1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

In [62]:
#from fasttext_emb import FastTextEmbedding_All

In [63]:
#reload time
start_time = time.time()
#fasttext_wiki_news = FastTextEmbedding_All(name='wiki_news', d_emb=300)
fasttext_wiki_news = Word_Emb(url='https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip', 
                     emb_alg = 'fasttext', short_name = 'wiki_news', vec_name = 'wiki-news-300d-1M.vec')
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 46.59 sec.


In [92]:
words_out_of_model_fasttext_wiki_news = utils.get_missing_words('Fasttext fasttext_wiki_news', 
                                                           fasttext_wiki_news, utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109584/109584 [11:50<00:00, 154.22it/s]


words out of Fasttext fasttext_wiki_news : 21392, percentage of successful word embedding 80.47890202949336%, unsuccessful 19.52109797050664%
Done. Time: 710.57 sec.


In [64]:
squad_words_and_freqs_df['Fasttext fasttext_wiki_news'] = utils.get_word_existence_for_WE(fasttext_wiki_news,
                                                                                     utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109584/109584 [09:42<00:00, 188.03it/s]


In [94]:
eng_words_out_of_model_fasttext_wiki_news = utils.get_missing_words('Fasttext fasttext_wiki_news', 
                                                       fasttext_wiki_news, utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104455/104455 [00:09<00:00, 10636.36it/s]


words out of Fasttext fasttext_wiki_news : 18009, percentage of successful word embedding 82.75908285864726%, unsuccessful 17.240917141352735%
Done. Time: 9.83 sec.


In [65]:
squad_words_and_freqs_eng_df['Fasttext fasttext_wiki_news'] = utils.get_word_existence_for_WE(fasttext_wiki_news,
                                                                         utils.get_word_list(squad_words_and_freqs_eng))

100%|████████████████████████████████| 104455/104455 [00:10<00:00, 9992.25it/s]


In [96]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng))
results_df.loc[len(results_df)] = ['fasttext fasttext_wiki_news', 665, '16 billion', '1 million', 
                                    'cased', 'Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset',
                                    'T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin',
                                    len(words_out_of_model_fasttext_wiki_news) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_fasttext_wiki_news) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_fasttext_wiki_news) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_fasttext_wiki_news) / squad_words_en_cnt * 100,
                                    '00:29:42']

In [66]:
#save results with word existence
squad_words_and_freqs_df.to_excel("../data/squad/word_freqs.xlsx")
squad_words_and_freqs_eng_df.to_excel("../data/squad/word_freqs_eng.xlsx")

### wiki-news-300d-1M-subword.vec

#### Subowrd information

Standard word vectors ignore word internal structure that contains rich information. This information could be useful
for computing representations of rare or mispelled words.
A simple yet effective approach is to enrich the word vectors with a bag of character n-gram vectors that is either derived from the singular value decomposition of the co-occurence matrix (Sch ¨utze, 1993) or directly learned from a large corpus of data (Bojanowski et al., 2017).

https://arxiv.org/pdf/1712.09405.pdf

https://fasttext.cc/docs/en/english-vectors.html

1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

In [67]:
#reload time
start_time = time.time()
#fasttext_wiki_news_subword = FastTextEmbedding_All(name='wiki_news_subword', d_emb=300)
fasttext_wiki_news_subword = Word_Emb(url='https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip', 
                     emb_alg = 'fasttext', short_name = 'wiki_news_subword', vec_name = 'wiki-news-300d-1M-subword.vec')
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 45.24 sec.


In [11]:
words_out_of_model_fasttext_wiki_news_subword = utils.get_missing_words('Fasttext fasttext_wiki_news_subword', 
                                                       fasttext_wiki_news_subword, utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109581/109581 [10:53<00:00, 167.68it/s]


words out of Fasttext fasttext_wiki_news_subword : 21393, percentage of successful word embedding 80.47745503326307%, unsuccessful 19.522544966736934%
Done. Time: 653.52 sec.


In [68]:
squad_words_and_freqs_df['Fasttext fasttext_wiki_news_subword'] = utils.get_word_existence_for_WE(fasttext_wiki_news_subword,
                                                                         utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109584/109584 [10:13<00:00, 178.48it/s]


In [53]:
eng_words_out_of_model_fasttext_wiki_news_subword = utils.get_missing_words('Fasttext fasttext_wiki_news_subword', 
                                                       fasttext_wiki_news_subword, utils.get_word_list(squad_words_and_freqs))

100%|███████████████████████████████| 104835/104835 [00:09<00:00, 10601.64it/s]


words out of Fasttext fasttext_wiki_news_subword : 17864, percentage of successful word embedding 82.95988934993085%, unsuccessful 17.040110650069156%
Done. Time: 9.89 sec.


In [69]:
squad_words_and_freqs_eng_df['Fasttext fasttext_wiki_news_subword'] = utils.get_word_existence_for_WE(fasttext_wiki_news_subword,
                                                                         utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104455/104455 [00:09<00:00, 10935.93it/s]


In [54]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng))
results_df.loc[len(results_df)] = ['fasttext fasttext_wiki_news_subword', 574, '16 billion', '1 million', 
                                    'cased', 'Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset',
                                    'T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin',
                                    len(words_out_of_model_fasttext_wiki_news_subword) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_fasttext_wiki_news_subword) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_fasttext_wiki_news_subword) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_fasttext_wiki_news_subword) / squad_words_en_cnt * 100,
                                    '00:27:12']

In [70]:
#save results with word existence
squad_words_and_freqs_df.to_excel("../data/squad/word_freqs.xlsx")
squad_words_and_freqs_eng_df.to_excel("../data/squad/word_freqs_eng.xlsx")

### crawl-300d-2M.vec

2 million word vectors trained on Common Crawl (600B tokens).

The source of text data for this model the common crawl. While they provide noisier data than Wikipedia articles, they come in larger amounts and with a broader coverage. (http://www.lrec-conf.org/proceedings/lrec2018/pdf/627.pdf)

In [71]:
#reload time
start_time = time.time()
fasttext_crawl = Word_Emb(url='https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip', 
               emb_alg = 'fasttext', short_name = 'crawl', vec_name = 'crawl-300d-2M.vec')
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 136.82 sec.


In [72]:
words_out_of_model_fasttext_crawl = utils.get_missing_words('Fasttext fasttext_crawl', fasttext_crawl, 
                                                            utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109584/109584 [12:59<00:00, 140.59it/s]


words out of Fasttext fasttext_crawl : 20018, percentage of successful word embedding 81.73273470579647%, unsuccessful 18.267265294203533%
Done. Time: 779.47 sec.


In [73]:
squad_words_and_freqs_df['Fasttext fasttext_crawl'] = utils.get_word_existence_for_WE(fasttext_crawl,
                                                                         utils.get_word_list(squad_words_and_freqs))

100%|███████████████████████████████| 109584/109584 [00:09<00:00, 11362.27it/s]


In [57]:
eng_words_out_of_model_fasttext_crawl = utils.get_missing_words('Fasttext fasttext_crawl', fasttext_crawl, 
                                                                utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104835/104835 [00:09<00:00, 10875.51it/s]


words out of Fasttext fasttext_crawl : 16425, percentage of successful word embedding 84.33252253541279%, unsuccessful 15.66747746458721%
Done. Time: 9.65 sec.


In [74]:
squad_words_and_freqs_eng_df['Fasttext fasttext_crawl'] = utils.get_word_existence_for_WE(fasttext_crawl,
                                                                         utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104455/104455 [00:09<00:00, 11240.75it/s]


In [58]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng))
results_df.loc[len(results_df)] = ['fasttext crawl', 1488, '600 billion', '2 million', 
                                    'cased', 'Common Crawl ',
                                    'T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin',
                                    len(words_out_of_model_fasttext_crawl) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_fasttext_crawl) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_fasttext_crawl) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_fasttext_crawl) / squad_words_en_cnt * 100,
                                    '1:43:49']

In [75]:
#save results with word existence
squad_words_and_freqs_df.to_excel("../data/squad/word_freqs.xlsx")
squad_words_and_freqs_eng_df.to_excel("../data/squad/word_freqs_eng.xlsx")

### crawl-300d-2M-subword

2 million word vectors trained with subword information on Common Crawl (600B tokens).

In [76]:
#reload time (initial load time= 3:50:24)
start_time = time.time()
fasttext_crawl_subword = Word_Emb(url='https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M-subword.zip', 
               emb_alg = 'fasttext', short_name = 'crawl-subword', vec_name = 'crawl-300d-2M-subword.vec')
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 145.81 sec.


In [60]:
words_out_of_model_fasttext_crawl_subword = utils.get_missing_words('Fasttext fasttext_crawl_subword', 
                                                      fasttext_crawl_subword, utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109964/109964 [15:23<00:00, 119.04it/s]


words out of Fasttext fasttext_crawl_subword : 19842, percentage of successful word embedding 81.95591284420355%, unsuccessful 18.044087155796444%
Done. Time: 923.75 sec.


In [77]:
squad_words_and_freqs_df['Fasttext fasttext_crawl_subword'] = utils.get_word_existence_for_WE(fasttext_crawl_subword,
                                                                         utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109584/109584 [15:33<00:00, 117.35it/s]


In [61]:
eng_words_out_of_model_fasttext_crawl_subword = utils.get_missing_words('Fasttext fasttext_crawl_subword', 
                                                  fasttext_crawl_subword,  utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104835/104835 [00:09<00:00, 10988.36it/s]


words out of Fasttext fasttext_crawl_subword : 16287, percentage of successful word embedding 84.46415796251252%, unsuccessful 15.53584203748748%
Done. Time: 9.55 sec.


In [78]:
squad_words_and_freqs_eng_df['Fasttext fasttext_crawl_subword'] = utils.get_word_existence_for_WE(fasttext_crawl_subword,
                                                                         utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104455/104455 [00:09<00:00, 11240.75it/s]


In [62]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng))
results_df.loc[len(results_df)] = ['fasttext fasttext_crawl_subword', 5691, '600 billion', '2 million', 
                                    'cased', 'Common Crawl ',
                                    'T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin',
                                    len(words_out_of_model_fasttext_crawl_subword) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_fasttext_crawl_subword) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_fasttext_crawl_subword) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_fasttext_crawl_subword) / squad_words_en_cnt * 100,
                                    '3:50:24']

In [79]:
#save results with word existence
squad_words_and_freqs_df.to_excel("../data/squad/word_freqs.xlsx")
squad_words_and_freqs_eng_df.to_excel("../data/squad/word_freqs_eng.xlsx")

In [63]:
results_df

Unnamed: 0,name,size(Mb),number of words in training,number of unique tokens,cased,source of words,made by,missing words %,embedding coverage %,missing words with latin letters %,embedding coverage with latin letters %,first loading time hours/min/sec
0,GloVe common_crawl_840,2126,840 billion,2 196 016,cased,Common Crawl,Stanford Univ,18.329635,81.670365,15.604521,84.395479,1:46:38
1,GloVe common_crawl_42,1833,42 billion,1 917 495,uncased,Common Crawl,Stanford Univ,14.806664,85.193336,12.110459,87.889541,1:30:10
2,GloVe twitter,1484,27 billion,1 193 515,uncased,Tweets,Stanford Univ,42.868575,57.131425,41.387895,58.612105,00:27:06
3,GloVe wikipedia_gigaword,841,6 billion,400 000,uncased,Wikipedia 2014 + Gigaword 5,Stanford Univ,22.313666,77.686334,19.90175,80.09825,00:03:58
4,fasttext fasttext_wiki,10114,more than 100 millions,2 518 927,uncased,Wikipedia,"P. Bojanowski*, E. Grave, A. Joulin, T. Mikolov",20.498527,79.501473,19.332284,80.667716,12:10:02
5,fasttext fasttext_wiki_news,665,16 billion,1 million,cased,"Wikipedia 2017, UMBC webbase corpus and statmt...","T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsc...",19.321778,80.678222,17.040111,82.959889,00:29:42
6,fasttext fasttext_wiki_news_subword,574,16 billion,1 million,cased,"Wikipedia 2017, UMBC webbase corpus and statmt...","T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsc...",19.321778,80.678222,17.040111,82.959889,00:27:12
7,fasttext crawl,1488,600 billion,2 million,cased,Common Crawl,"T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsc...",18.175948,81.824052,15.667477,84.332523,1:43:49
8,fasttext fasttext_crawl_subword,5691,600 billion,2 million,cased,Common Crawl,"T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsc...",18.044087,81.073115,14.811211,84.464158,3:50:24


In [398]:
test_words = ['123', '3', 'cyclotron', 'Knowles-Carter', 'New', 'York', 'New York']
test_words_lowercased = ['123', '3', 'cyclotron', 'knowles-carter', 'new', 'york', 'new york']
fasttext_model_names = ['fasttext_wiki', 'fasttext_wiki_news', 'fasttext_wiki_news_subword', 'fasttext_crawl', 
                        'fasttext_crawl_subword']
fasttext_model_case = ['uncased','cased','cased','cased', 'cased']
fasttext_models = [fasttext_wiki, fasttext_wiki_news, fasttext_wiki_news_subword, fasttext_crawl, fasttext_crawl_subword]

test_df = utils.process_test_words(fasttext_model_names, fasttext_model_case, test_words, test_words_lowercased, fasttext_models)
test_df

Unnamed: 0,name,123,3,cyclotron,Knowles-Carter,New,York,New York
0,fasttext_wiki,not in dict,not in dict,in dict,not in dict,in dict,in dict,not in dict
1,fasttext_wiki_news,in dict,in dict,in dict,in dict,in dict,in dict,not in dict
2,fasttext_wiki_news_subword,in dict,in dict,in dict,in dict,in dict,in dict,not in dict
3,fasttext_crawl,in dict,in dict,in dict,in dict,in dict,in dict,not in dict
4,fasttext_crawl_subword,in dict,in dict,in dict,in dict,in dict,in dict,not in dict


#### Conclusion

The best pretrained model among fasttext models is fasttext_crawl_subword as it has the less percentage of uncovered SQuAD words.

## Word2vec

Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. 

The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.

It does so in one of two ways, either using context to predict a target word (a method known as continuous bag of words, or CBOW), or using a word to predict a target context, which is called skip-gram. 

https://skymind.ai/wiki/word2vec

pretrained models: full list (https://developer.syn.co.in/tutorial/bot/oscova/pretrained-vectors.html)

In the SKIPGRAM embedding algorithm, the contexts of a word w are the words surrounding it in the text - it is a linear bag-of-words algorithm for choosing the context. Using a window of size k around the target word w, 2k contexts are produced: the
k words before and the k words after w. There is another algorithm for choosing contexts of a word in SKIPGRAM model: dependency-Based contexts - an alternative to the bag-of-words approach is to derive contexts based on the syntactic relations the word participates in.

https://levyomer.files.wordpress.com/2014/04/dependency-based-word-embeddings-acl-2014.pdf

|Model file	| Number of dimensions|	Corpus (size) |	Vocabulary size | Author |	Architecture |	Training Algorithm |	Context window - size	| Web page |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Google News | 300 |	Google News (100B) |	3M |	Google |	word2vec |	negative sampling |	BoW - ~5 | https://code.google.com/archive/p/word2vec/ |
| Freebase IDs | 1000 |	Gooogle News (100B) |	1.4M |	Google |	word2vec, skip-gram	| ? |	BoW - ~10 |	https://code.google.com/archive/p/word2vec/ | 
| Freebase names | 1000 |	Gooogle News (100B) |	1.4M |	Google |	word2vec, skip-gram	| ? |	BoW - ~10 |	https://code.google.com/archive/p/word2vec/ |
| Wikipedia dependency | 300 | Wikipedia (?) | 174,015 | Levy & Goldberg | word2vec modified | word2vec	| syntactic dependencies | https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/ |
| DBPedia vectors (wiki2vec) | 1000	| Wikipedia (?)	| ? | Idio	| word2vec | word2vec, skip-gram | BoW, 10	| https://github.com/idio/wiki2vec#prebuilt-models |

### Google news

Pretrained model includes word vectors for a vocabulary of 3 million words and phrases that were trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features.

In [80]:
#reload time (initial load time = 6:11:00)
start_time = time.time()
word2vec_google_news = Word_Emb(url='https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', 
               emb_alg = 'word2vec', short_name = 'google_news', binary = True, 
               bin_name = 'GoogleNews-vectors-negative300.bin')
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 275.86 sec.


In [65]:
words_out_of_model_word2vec_google_news = utils.get_missing_words('word2vec word2vec_google_news', 
                                                              word2vec_google_news, utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109964/109964 [11:44<00:00, 156.02it/s]


words out of word2vec word2vec_google_news : 39267, percentage of successful word embedding 64.29104070423048%, unsuccessful 35.708959295769525%
Done. Time: 704.82 sec.


In [81]:
squad_words_and_freqs_df['word2vec word2vec_google_news'] = utils.get_word_existence_for_WE(word2vec_google_news,
                                                                         utils.get_word_list(squad_words_and_freqs))

100%|█████████████████████████████████| 109584/109584 [10:03<00:00, 181.53it/s]


In [66]:
eng_words_out_of_model_word2vec_google_news = utils.get_missing_words('word2vec word2vec_google_news', 
                                                      word2vec_google_news, utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104835/104835 [00:09<00:00, 11129.53it/s]


words out of word2vec word2vec_google_news : 34925, percentage of successful word embedding 66.68574426479707%, unsuccessful 33.314255735202934%
Done. Time: 9.42 sec.


In [82]:
squad_words_and_freqs_eng_df['word2vec word2vec_google_news'] = utils.get_word_existence_for_WE(word2vec_google_news,
                                                                         utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104455/104455 [00:08<00:00, 12148.06it/s]


In [67]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng))
results_df.loc[len(results_df)] = ['word2vec google_news', 1608, '100 billion', '3 million', 
                                    'cased', 'Google News', 'Google',
                                    len(words_out_of_model_word2vec_google_news) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_word2vec_google_news) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_word2vec_google_news) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_word2vec_google_news) / squad_words_en_cnt * 100,
                                    '6:11:00']

In [83]:
#save results with word existence
squad_words_and_freqs_df.to_excel("../data/squad/word_freqs.xlsx")
squad_words_and_freqs_eng_df.to_excel("../data/squad/word_freqs_eng.xlsx")

### Freebase IDs, Freebase names

- freebase-vectors-skipgram1000.bin.gz (Freebase IDs): Entity vectors trained on 100B words from various news articles. Key value is Freebase identifier
- freebase-vectors-skipgram1000-en.bin.gz (Freebase names): Entity vectors trained on 100B words from various news articles, using the deprecated /en/ naming (more easily readable); the vectors are sorted by frequency. the same vectors as Freebase IDs, but key value is /en/+ word


The skipgram model with negative sampling is used to train the vectors in Google Freebase. The vectors in this dataset have 1000 dimensions in length. For preparing
the embedding for phrases, they used a statistical approach to find words that appear more together than separately and then considered them as a single token.
In the next step, they replaced these tokens with their corresponding freebase
ID. Freebase is a knowledge base containing millions of entities and concepts,
mostly extracted from Wikipedia pages.

https://arxiv.org/pdf/1702.03470.pdf

In [84]:
#reload time
start_time = time.time()
word2vec_freebase_ids = Word_Emb(url='https://drive.google.com/uc?export=download&confirm=u93n&id=0B7XkCwpI5KDYeFdmcVltWkhtbmM', 
               emb_alg = 'word2vec', short_name = 'freebase_ids', binary = True, bin_name = 'freebase_ids.bin')
print("Done. Time: {} sec.".format(round(time.time()-start_time, 2)))

Done. Time: 80.4 sec.


In [69]:
#squad words to freebase key format 
squad_words_uncased_en = ['/en/'+w for w in squad_words_uncased]
eng_squad_words_uncased_en = ['/en/'+w for w in squad_words_eng_uncased]

In [70]:
words_out_of_model_word2vec_freebase_ids = utils.get_missing_words('word2vec word2vec_freebase_ids', 
                                                                          word2vec_freebase_ids, squad_words_uncased_en)

100%|████████████████████████████████| 109964/109964 [01:35<00:00, 1147.32it/s]


words out of word2vec word2vec_freebase_ids : 75696, percentage of successful word embedding 31.16292604852498%, unsuccessful 68.83707395147502%
Done. Time: 95.85 sec.


In [71]:
eng_words_out_of_model_word2vec_freebase_ids = utils.get_missing_words('word2vec word2vec_freebase_ids', 
                                                                          word2vec_freebase_ids, eng_squad_words_uncased_en)

100%|███████████████████████████████| 104835/104835 [00:07<00:00, 13617.69it/s]


words out of word2vec word2vec_freebase_ids : 70567, percentage of successful word embedding 32.68755663661945%, unsuccessful 67.31244336338055%
Done. Time: 7.7 sec.


In [406]:
#select first 15 rows from database
import sqlite3
 
conn = sqlite3.connect("C:\\MRC\\squad\\.embeddings\\word2vec\\freebase_ids.db")
cursor = conn.cursor()

cursor.execute("SELECT max(rowid) from embeddings")
print('number of rows: %i' % cursor.fetchone()[0])
cursor.execute("select * from embeddings limit 15")
print(cursor.fetchall())

number of rows: 1422903
[('/en/united_states', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/associated_press', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/barack_obama', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/china', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/united_kingdom', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/new_york', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/india', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/europe', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/washington_united_states', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/canada', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/iraq', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/israel', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/california', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/republican_party', b'\x00\x00\x80?\x00\x00\x00@'), ('/en/austraila', b'\x00\x00\x80?\x00\x00\x00@')]


In [72]:
results_df.loc[len(results_df)] = ['word2vec freebase_ids', 2473, '100 billion', '1.4 million', 
                                    'uncased', 'Google News', 'Google',
                                    len(words_out_of_model_word2vec_freebase_ids) / len(squad_words) * 100,
                                    100 -  len(words_out_of_model_word2vec_freebase_ids) / len(squad_words) * 100,
                                    len(eng_words_out_of_model_word2vec_freebase_ids) / len(squad_words_eng) * 100,
                                    100 -  len(eng_words_out_of_model_word2vec_freebase_ids) / len(squad_words_eng) * 100,
                                    '00:23:59']

The missing word percentage is very high for word2vec_freebase_ids because this model use _ as a word separator and also it is the one model which doesn't divide namings into separate words (for example united_kingdom, new_york)

### DBpedia vectors (wiki2vec)

Each Wikipedia page is represented by DBpedia resource.  We use prebuilt model for the English Wikipedia, without stemming, vector dimension is 1000.

In [85]:
#initial load time
word2vec_dbpedia = Word_Emb(emb_alg = 'word2vec', short_name = 'dbpedia', shrink_vector_space = True, 
                            binary = False, bin_name = 'en.model', open_as_text = False)

In [74]:
words_out_of_model_word2vec_dbpedia = utils.get_missing_words('word2vec word2vec_dbpedia',word2vec_dbpedia, 
                                                              utils.get_word_list(squad_words_and_freqs))

100%|████████████████████████████████| 109964/109964 [01:20<00:00, 1369.40it/s]


words out of word2vec word2vec_dbpedia : 29111, percentage of successful word embedding 73.52679058600997%, unsuccessful 26.473209413990034%
Done. Time: 80.31 sec.


In [86]:
squad_words_and_freqs_df['word2vec word2vec_dbpedia'] = utils.get_word_existence_for_WE(word2vec_dbpedia,
                                                                         utils.get_word_list(squad_words_and_freqs))

100%|████████████████████████████████| 109584/109584 [01:10<00:00, 1551.06it/s]


In [75]:
eng_words_out_of_model_word2vec_dbpedia = utils.get_missing_words('word2vec word2vec_dbpedia',word2vec_dbpedia, 
                                                                  utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104835/104835 [00:08<00:00, 12709.63it/s]


words out of word2vec word2vec_dbpedia : 25640, percentage of successful word embedding 75.54251919683313%, unsuccessful 24.45748080316688%
Done. Time: 8.25 sec.


In [87]:
squad_words_and_freqs_eng_df['word2vec word2vec_dbpedia'] = utils.get_word_existence_for_WE(word2vec_dbpedia,
                                                                         utils.get_word_list(squad_words_and_freqs_eng))

100%|███████████████████████████████| 104455/104455 [00:07<00:00, 13451.26it/s]


In [76]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng))
results_df.loc[len(results_df)] = ['word2vec dbpedia', 4496, '100 billion', '1 151 090', 
                                    'cased', 'Wikipedia', 'Idio',
                                    len(words_out_of_model_word2vec_dbpedia) / squad_words_cnt * 100,
                                    100 -  len(words_out_of_model_word2vec_dbpedia) / squad_words_cnt * 100,
                                    len(eng_words_out_of_model_word2vec_dbpedia) / squad_words_en_cnt * 100,
                                    100 -  len(eng_words_out_of_model_word2vec_dbpedia) / squad_words_en_cnt * 100,
                                    '00:15:49']

In [88]:
#save results with word existence
squad_words_and_freqs_df.to_excel("../data/squad/word_freqs.xlsx")
squad_words_and_freqs_eng_df.to_excel("../data/squad/word_freqs_eng.xlsx")

### Wikipedia dependency

Wikipedia dependency was trained over context extracted from a dependency analysis of Wikipedia articles.

In [89]:
#initial load time
word2vec_wiki_dependency = Word_Emb(emb_alg = 'word2vec', short_name = 'wiki_dependency', shrink_vector_space = True, 
                            binary = False, vec_name = 'deps.words')

In [78]:
words_out_of_model_word2vec_wiki_dependency = utils.get_missing_words('word2vec word2vec_wiki_dependency',
                                              word2vec_wiki_dependency, utils.get_word_list(squad_words_and_freqs_uncased))

100%|███████████████████████████████| 109964/109964 [00:08<00:00, 12857.54it/s]


words out of word2vec word2vec_wiki_dependency : 34102, percentage of successful word embedding 68.98803244698266%, unsuccessful 31.01196755301735%
Done. Time: 8.56 sec.


In [90]:
squad_words_and_freqs_uncased_df['word2vec word2vec_wiki_dependency'] = utils.get_word_existence_for_WE(word2vec_wiki_dependency,
                                                                 utils.get_word_list(squad_words_and_freqs_uncased))

100%|█████████████████████████████████| 95465/95465 [00:06<00:00, 14144.25it/s]


In [79]:
eng_words_out_of_model_word2vec_wiki_dependency = utils.get_missing_words('word2vec word2vec_wiki_dependency',
                                              word2vec_wiki_dependency, utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|███████████████████████████████| 104835/104835 [00:07<00:00, 13467.24it/s]


words out of word2vec word2vec_wiki_dependency : 30049, percentage of successful word embedding 71.33686268898745%, unsuccessful 28.663137311012544%
Done. Time: 7.79 sec.


In [93]:
squad_words_and_freqs_eng_uncased_df['word2vec word2vec_wiki_dependency'] = utils.get_word_existence_for_WE(word2vec_wiki_dependency,
                            utils.get_word_list(squad_words_and_freqs_eng_uncased))

100%|█████████████████████████████████| 90601/90601 [00:06<00:00, 13804.03it/s]


In [80]:
squad_words_cnt = len(utils.get_word_list(squad_words_and_freqs_uncased))
squad_words_en_cnt = len(utils.get_word_list(squad_words_and_freqs_eng_uncased))
results_df.loc[len(results_df)] = ['word2vec wiki_dependency', 839, '100 billion', '174 015', 
                                    'cased', 'Wikipedia', 'Levy & Goldberg',
                                    len(words_out_of_model_word2vec_wiki_dependency) / len(squad_words) * 100,
                                    100 -  len(words_out_of_model_word2vec_wiki_dependency) / len(squad_words) * 100,
                                    len(eng_words_out_of_model_word2vec_wiki_dependency) / len(squad_words_eng) * 100,
                                    100 -  len(eng_words_out_of_model_word2vec_wiki_dependency) / len(squad_words_eng) * 100,
                                    '00:05:03']

In [94]:
#save results with word existence
squad_words_and_freqs_uncased_df.to_excel("../data/squad/word_freqs_uncased.xlsx")
squad_words_and_freqs_eng_uncased_df.to_excel("../data/squad/word_freqs_eng_uncased.xlsx")

In [81]:
results_df

Unnamed: 0,name,size(Mb),number of words in training,number of unique tokens,cased,source of words,made by,missing words %,embedding coverage %,missing words with latin letters %,embedding coverage with latin letters %,first loading time hours/min/sec
0,GloVe common_crawl_840,2126,840 billion,2 196 016,cased,Common Crawl,Stanford Univ,18.329635,81.670365,15.604521,84.395479,1:46:38
1,GloVe common_crawl_42,1833,42 billion,1 917 495,uncased,Common Crawl,Stanford Univ,14.806664,85.193336,12.110459,87.889541,1:30:10
2,GloVe twitter,1484,27 billion,1 193 515,uncased,Tweets,Stanford Univ,42.868575,57.131425,41.387895,58.612105,00:27:06
3,GloVe wikipedia_gigaword,841,6 billion,400 000,uncased,Wikipedia 2014 + Gigaword 5,Stanford Univ,22.313666,77.686334,19.90175,80.09825,00:03:58
4,fasttext fasttext_wiki,10114,more than 100 millions,2 518 927,uncased,Wikipedia,"P. Bojanowski*, E. Grave, A. Joulin, T. Mikolov",20.498527,79.501473,19.332284,80.667716,12:10:02
5,fasttext fasttext_wiki_news,665,16 billion,1 million,cased,"Wikipedia 2017, UMBC webbase corpus and statmt...","T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsc...",19.321778,80.678222,17.040111,82.959889,00:29:42
6,fasttext fasttext_wiki_news_subword,574,16 billion,1 million,cased,"Wikipedia 2017, UMBC webbase corpus and statmt...","T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsc...",19.321778,80.678222,17.040111,82.959889,00:27:12
7,fasttext crawl,1488,600 billion,2 million,cased,Common Crawl,"T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsc...",18.175948,81.824052,15.667477,84.332523,1:43:49
8,fasttext fasttext_crawl_subword,5691,600 billion,2 million,cased,Common Crawl,"T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsc...",18.044087,81.073115,14.811211,84.464158,3:50:24
9,word2vec google_news,1608,100 billion,3 million,cased,Google News,Google,35.708959,64.291041,33.314256,66.685744,6:11:00


### Conclusion

The best pretrained model among word2vec models is dbpedia as it has the less percentage of uncovered SQuAD words. 
The best pretrained model among all models is GloVe common_crawl_42 as it has the less percentage of uncovered SQuAD words.

## ELMO

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. 

The representations are generated from a function of the entire sentence to create word-level representations. The embeddings are generated at a character-level, so they can capitalize on sub-word units like FastText and do not suffer from the issue of out-of-vocabulary words.

In [1]:
from embeddings.elmo import ElmoEmbedding

In [2]:
elmo = ElmoEmbedding()

In [3]:
for w in ['canada', 'ghjffg', 'cyclotron', 'New York']:
    print('embedding for {0} = {1}'.format(w, elmo.emb(w)))

embedding for canada = [-0.12928208708763123, 0.1797446757555008, 0.32043027877807617, 0.09774105250835419, -0.23000513017177582, 0.045169465243816376, 0.3189358711242676, 0.08198004961013794, 0.06236783415079117, -0.1548035591840744, 0.026339039206504822, 0.2822563052177429, 0.3813740015029907, -0.39079368114471436, 0.05570315569639206, -0.24463962018489838, -0.15392930805683136, -0.13432516157627106, 0.10219113528728485, 0.20746605098247528, -0.44593366980552673, -0.3348473012447357, -0.3258061707019806, 0.41351082921028137, 0.49114280939102173, -0.08096278458833694, 0.03477930277585983, -0.01950651779770851, 0.14569328725337982, 0.2253284752368927, -0.03725171834230423, 0.1801299750804901, 0.05353356897830963, -0.016776323318481445, 0.012518258765339851, 0.06441564857959747, -0.0976974368095398, 0.24552100896835327, -0.26171183586120605, -0.1044323593378067, -0.05325465649366379, 0.11762401461601257, 0.048691753298044205, 0.08610888570547104, -0.2276361882686615, 0.39680972695350647

Elmo encode any combination of characters even it is not exist in English dictionary