# Language Technology - 3rd Tutorial on Word Embeddings (Word2Vec, FastText) using Gensim


----
## Important Resources

Language Technology Resources: https://eclass.aueb.gr/modules/document/index.php?course=INF210

Python Official Documentation: https://docs.python.org/3.5/

NLTK: https://www.nltk.org

Gensim: https://radimrehurek.com/gensim/models/word2vec.html

Word2vec: https://code.google.com/archive/p/word2vec/

----

## Tutorial 5 Schedule

**In order to familiarize ourselves with word embeddings, we have the following sections:**

* **Train Word Embeddings**
    * **Data Cleansing**
    * **Lazy Load Data**
    * **Train Word2Vec Model**
    * **Train FastText Model**
* **Use Word Embeddings**
* **Train Word Embeddings with extra special tokens**

----
# Background

By the term word embedding in natural language processing (NLP), we describe a feature representation where words (tokens) or phrases (multi-token) from the vocabulary are mapped to vectors of real numbers. Distributed representations of words in a vector space help learning algorithms to achieve better performance in natural language processing tasks by grouping similar words. The ultimate leverage of such a technique is the transition from the traditional sparse features (i.e. one-hot vector representation) onto the dense vector space of common shared features.


## Word2vec model sampling

_"You shall know a word by the company it keeps" (Firth, J. R. 1957:11)_

<img src="https://i.imgur.com/v34sAaT.png" width="700">




## Word2vec model architecture

<img src="https://lilianweng.github.io/lil-log/assets/images/word2vec-skip-gram.png" width="700">



## Publicly available word embeddings

There are various pre-trained word embeddings, trained on large corpora such as Google Vectors, which are trained over a vast corpus in Google News containing 100-billion words and finally producing a vocabulary of 3 million words.

See here (https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models) for a list of every popular pre-trained word embeddings set.


----
# Train Word Embeddings

In order to train word embeddings with gensim library, we have to prepare a list of sentences, which then will be fed to the word2vec model. We need to address this matter in a memory friendly way. In this tutotial we are going to follow a 2-step approach:


## Data Cleansing - Preprocessing

First we will clean (pre-process) our corpus by producing a new set of files in which every single line will be a sentence and sentence's words will be splitted by a simple space.

**Python Code Example**

In [22]:
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import gutenberg
import os
import re

PROCESSED_CORPUS_FOLDER = '/Users/kiddo/Desktop/word2vec_corpus'

# Create new folder if it does not exist
if not os.path.exists(PROCESSED_CORPUS_FOLDER):
    os.mkdir(PROCESSED_CORPUS_FOLDER)

# Get filenames for Gutenberg corpus
file_ids = gutenberg.fileids()

print(file_ids)
print('{} BOOKS LOADED!'.format(len(file_ids)))

# Iterate over filenames
for filename in file_ids:
    # Create new file including one sentence per line, tokenized on white-space 
    with open(os.path.join(PROCESSED_CORPUS_FOLDER, filename), 'w', encoding='utf-8') as output_file:
        input_text = gutenberg.raw(filename)
        for sentence in sent_tokenize(input_text):
            sentence = re.sub('\n+', ' ', sentence)
            splitted_sentence = ' '.join([token.lower() for token in word_tokenize(sentence)]) + '\n'
            output_file.write(splitted_sentence)
    print(filename + ' DONE')

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
18 BOOKS LOADED!
austen-emma.txt DONE
austen-persuasion.txt DONE
austen-sense.txt DONE
bible-kjv.txt DONE
blake-poems.txt DONE
bryant-stories.txt DONE
burgess-busterbrown.txt DONE
carroll-alice.txt DONE
chesterton-ball.txt DONE
chesterton-brown.txt DONE
chesterton-thursday.txt DONE
edgeworth-parents.txt DONE
melville-moby_dick.txt DONE
milton-paradise.txt DONE
shakespeare-caesar.txt DONE
shakespeare-hamlet.txt DONE
shakespeare-macbeth.txt DONE
whitman-leaves.txt DONE


----
## Lazy Load Data

Secondly we will create a generator, which will load the sentences in a lazy-loading fashion.

**Python Code Example**

In [39]:
import os
import glob

class CorpusLoader(object):

    def __init__(self,parent_folder=None):
        self.dirname = parent_folder

    def __iter__(self):
        for fname in glob.glob(os.path.join(self.dirname, '*.txt')):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()


## Train Word2Vec Model

Initialize word2vec model with specific parameters and train using the cleansed corpus and our generator.

**Python Code Example**

In [40]:
from gensim.models import Word2Vec
from multiprocessing import cpu_count

# CREATE GENERATOR OBJECT TO STREAM SENTENCES
sentences = CorpusLoader(parent_folder=PROCESSED_CORPUS_FOLDER)

# CONFIGURE WORD2VEC MODEL
model = Word2Vec(min_count=10, workers=cpu_count(), size=50, sg=1, window=5)

# BUILD VOCABULARY FROM SENTENCES CONSIDERING MIN_COUNT
print('START BUILDING VOCABULARY...')
model.build_vocab(sentences)
print('VOCABULARY_SIZE: ',len(model.wv.index2word))

# TRAIN MODEL
print('START TRAINING...')
model.train(sentences, total_examples=model.corpus_count, epochs=10)

model.wv.save_word2vec_format(os.path.join(PROCESSED_CORPUS_FOLDER, 'WORD2VEC.bin'), binary=True)
print('MODEL SAVED....')

START BUILDING VOCABULARY...
VOCABULARY_SIZE:  11028
START TRAINING...
MODEL SAVED....


## Train FastText Model

*Python Code Example*

In [58]:
from gensim.models import FastText

from multiprocessing import cpu_count

# CREATE GENERATOR OBJECT TO STREAM SENTENCES
sentences = CorpusLoader(parent_folder=PROCESSED_CORPUS_FOLDER)

# CONFIGURE WORD2VEC MODEL
model = FastText(workers=cpu_count(), size=50, sg=1, window=5)

# BUILD VOCABULARY FROM SENTENCES CONSIDERING MIN_COUNT
print('START BUILDING VOCABULARY...')
model.build_vocab(sentences)
print('VOCABULARY_SIZE: ',len(model.wv.index2word))

# TRAIN MODEL
print('START TRAINING...')
model.train(sentences, total_examples=model.corpus_count, epochs=10)

model.save(os.path.join(PROCESSED_CORPUS_FOLDER, 'FASTTEXT.bin'))
print('MODEL SAVED....')

START BUILDING VOCABULARY...
VOCABULARY_SIZE:  17212
START TRAINING...
MODEL SAVED....


----
# Use word embeddings

**Python Code Example**

In [51]:
from gensim.models import KeyedVectors

# LOAD WORD2VEC LOOKUP TABLE
w2v_model = KeyedVectors.load_word2vec_format(os.path.join(PROCESSED_CORPUS_FOLDER, 'WORD2VEC.bin'), binary=True)

print(w2v_model['good'])

[-0.10449308 -0.43362677  0.19262557 -0.2593011   0.17601196 -0.28308824
 -0.17368805 -0.1415757  -0.63802737  0.1822876   0.44478205  0.03437993
 -0.73681957  0.13764866 -0.18100835  0.14415143  0.45553008  0.11168631
 -0.32792276  0.24922058  0.55671835  0.03754492  0.22137128 -0.23959972
  0.13231476  1.1833037   0.49699035  0.09289576 -0.12476934 -0.14078905
 -0.05570467 -0.409427    0.05100378  0.44504216  0.3083692   0.20310277
  0.3315212  -0.04255746  0.04521555 -0.01845262 -0.26361692 -0.0837402
 -0.38756976  0.44985986 -0.36964557 -0.14428945  0.6458716  -0.13912514
 -0.10920626 -0.08730115]


In [56]:
# SEARCH (LOOKUP) WORD EMBEDDINGS
words = ['night', 'dog', 'umbrella', 'shoe', 'man', 'woman', 'boat', 'love', 'george', '5', '587']

# FIND MOST SIMILAR WORDS
for word in words:
    if word in w2v_model:
        record = '{:8}\n'.format(word)
        record += '--------------------\n'
        for sim_word, sim_score in w2v_model.most_similar(positive=[word]):
            record += '{:15}| {:.2f}  \n'.format(sim_word, sim_score)
        print(record)
    else:
        print('{:8} NOT FOUND'.format(word))


night   
--------------------
day            | 0.80  
afternoon      | 0.76  
morning        | 0.73  
autumn         | 0.72  
waking         | 0.71  
awake          | 0.70  
musing         | 0.70  
dawn           | 0.70  
winter         | 0.69  
noon           | 0.68  

dog     
--------------------
thief          | 0.76  
pet            | 0.75  
todhunter      | 0.72  
lion           | 0.72  
alligator      | 0.71  
boy            | 0.71  
fox            | 0.70  
pig            | 0.69  
bullet         | 0.69  
farmer         | 0.68  

umbrella
--------------------
frock-coat     | 0.91  
jacket         | 0.89  
waistcoat      | 0.87  
collar         | 0.87  
handkerchief   | 0.87  
man's          | 0.87  
moustache      | 0.86  
clerical       | 0.86  
slouched       | 0.86  
tight          | 0.86  

shoe    
--------------------
skirt          | 0.77  
mane           | 0.76  
foot           | 0.76  
neck           | 0.76  
vesture        | 0.75  
heels          | 0.75  
bedchamber   

In [53]:
# SIMILARITY MEASURE
print(w2v_model.similarity('man', 'woman'))
print(w2v_model.similarity('man', 'pig'))
print(w2v_model.similarity('man', 'chair'))

0.7261805
0.52326065
0.43960875


 FastText in contrast with Word2Vec does not rely on a fixed vocabulary, so we need to load the actual trained model.

In [62]:
fasttext_model = FastText.load(os.path.join(PROCESSED_CORPUS_FOLDER, 'FASTTEXT.bin'))

# SEARCH FASTTEXT MODEL
words = ['night', 'dog', 'umbrella', 'shoe', 'man', 'woman', 'boat', 'love', 'george', '5', 'dog-walker']

# FIND MOST SIMILAR WORDS
for word in words:
    record = '{:8}\n'.format(word)
    record += '--------------------\n'
    for sim_word, sim_score in fasttext_model.wv.most_similar(positive=[word]):
        record += '{:15}| {:.2f}  \n'.format(sim_word, sim_score)
    print(record)

night   
--------------------
nightfall      | 0.84  
daylight       | 0.80  
nightmare      | 0.79  
noonday        | 0.79  
midnight       | 0.79  
yesternight    | 0.78  
morning        | 0.77  
day            | 0.76  
noon           | 0.75  
tonight        | 0.75  

dog     
--------------------
cage           | 0.80  
yarman         | 0.80  
todhunter      | 0.78  
cradle         | 0.77  
tomb           | 0.77  
chimney        | 0.76  
pig            | 0.75  
chimney-sweeper| 0.75  
oyster         | 0.75  
banker         | 0.75  

umbrella
--------------------
frock-coat     | 0.85  
dressing-gown  | 0.85  
shawl          | 0.85  
waistcoat      | 0.83  
negro          | 0.83  
jacket         | 0.83  
turkish        | 0.82  
sombre         | 0.82  
turban         | 0.82  
arm-chair      | 0.81  

shoe    
--------------------
foot           | 0.81  
foote          | 0.80  
shoes          | 0.79  
toe            | 0.79  
shod           | 0.78  
cloak          | 0.78  
loafe        

## Train Word2Vec Model with extra special tokens

(a) Normalize numbers, (b) Create unknown token, (c) Preserve newline character

**Python Code Example**

In [78]:
from collections import Counter

# Create new folder if it does not exist
if not os.path.exists(PROCESSED_CORPUS_FOLDER):
    os.mkdir(PROCESSED_CORPUS_FOLDER)

# Get filenames for Gutenberg corpus
file_ids = gutenberg.fileids()

print(file_ids)
print('{} BOOKS LOADED!'.format(len(file_ids)))

tokens = []

# Iterate over filenames
for filename in file_ids:
    # Create new file including one sentence per line, tokenized on white-space 
    input_text = gutenberg.raw(filename)
    for sentence in sent_tokenize(input_text):
        # Preserve newline with snowmam :D
        sentence = re.sub('\n+', ' ☃️ ', sentence)
        tokens.extend([token.lower() if not re.search('[0-9]', token) else re.sub('[0-9]', 'D', token.lower())
                       for token in word_tokenize(sentence)])
print('TOTAL TOKENS IN CORPUS: {}'.format(len(tokens)))
vocab = Counter(tokens)

print('10 MOST COMMON TOKENS: {}'.format(vocab.most_common(10)))

# ELIMINATE RARE WORDS (<10)
vocab = {k:v for k,v in vocab.most_common() if v >= 10}

# Iterate over filenames
for filename in file_ids:
    # Create new file including one sentence per line, tokenized on white-space 
    with open(os.path.join(PROCESSED_CORPUS_FOLDER, filename), 'w', encoding='utf-8') as output_file:
        input_text = gutenberg.raw(filename)
        for sentence in sent_tokenize(input_text):
            sentence = re.sub('\n+', ' ☃️ ', sentence)
            normalized_tokens = []
            for token in word_tokenize(sentence):
                if token == '☃️':
                    normalized_tokens.append('#NEWLINE#')
                elif re.search('[0-9]', token):
                    token_norm = re.sub('[0-9]', 'D', token.lower())
                    if token_norm in vocab:
                        normalized_tokens.append(token_norm)
                    else:
                        normalized_tokens.append('#UNK#')
                else:
                    if token.lower() in vocab:
                        normalized_tokens.append(token.lower())
                    else:
                        normalized_tokens.append('#UNK#')
            splitted_sentence = ' '.join(normalized_tokens) + '\n'
            output_file.write(splitted_sentence)
    print(filename + ' DONE')
    

from gensim.models import Word2Vec
from multiprocessing import cpu_count

# CREATE GENERATOR OBJECT TO STREAM SENTENCES
sentences = CorpusLoader(parent_folder=PROCESSED_CORPUS_FOLDER)

# CONFIGURE WORD2VEC MODEL
model = Word2Vec(workers=cpu_count(), size=50, sg=1, window=5)

# BUILD VOCABULARY FROM SENTENCES CONSIDERING MIN_COUNT
print('START BUILDING VOCABULARY...')
model.build_vocab(sentences)
print('VOCABULARY_SIZE: ',len(model.wv.index2word))

# TRAIN MODEL
print('START TRAINING...')
model.train(sentences, total_examples=model.corpus_count, epochs=10)

model.wv.save_word2vec_format(os.path.join(PROCESSED_CORPUS_FOLDER, 'WORD2VEC_UNK.bin'), binary=True)
print('MODEL SAVED....')

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
18 BOOKS LOADED!
10 MOST COMMON TOKENS: [(',', 192339), ('☃️', 153371), ('the', 133513), ('and', 95296), ('.', 76625), ('of', 71206), ('to', 47692), ('a', 33804), ('in', 33480), ('i', 29973)]
austen-emma.txt DONE
austen-persuasion.txt DONE
austen-sense.txt DONE
bible-kjv.txt DONE
blake-poems.txt DONE
bryant-stories.txt DONE
burgess-busterbrown.txt DONE
carroll-alice.txt DONE
chesterton-ball.txt DONE
chesterton-brown.txt DONE
chesterton-thursday.txt DONE
edgeworth-parents.txt DONE
melville-moby_dick.txt DONE
milton-paradise.txt DONE
shakespeare-caesar.txt DONE
shakespeare-ha

In [82]:
from gensim.models import KeyedVectors

# LOAD WORD2VEC LOOKUP TABLE
w2v_model = KeyedVectors.load_word2vec_format(os.path.join(PROCESSED_CORPUS_FOLDER, 'WORD2VEC_UNK.bin'), binary=True)

special_tokens = ['#UNK#', '#NEWLINE#', 'DD', 'DDDD', 'D,DDD']

# FIND MOST SIMILAR WORDS
for word in special_tokens:
    if word in w2v_model:
        record = '{:8}\n'.format(word)
        record += '--------------------\n'
        for sim_word, sim_score in w2v_model.most_similar(positive=[word]):
            record += '{:15}| {:.2f}  \n'.format(sim_word, sim_score)
        print(record)
    else:
        print('{:8} NOT FOUND'.format(word))

#UNK#   
--------------------
#NEWLINE#      | 0.89  
,              | 0.86  
precision      | 0.81  
ramadan        | 0.79  
motto          | 0.78  
preliminary    | 0.78  
and            | 0.77  
practically    | 0.77  
correctly      | 0.77  
obviously      | 0.77  

#NEWLINE#
--------------------
,              | 0.93  
#UNK#          | 0.89  
and            | 0.88  
;              | 0.86  
that           | 0.83  
to             | 0.81  
--             | 0.80  
the            | 0.80  
of             | 0.78  
curate         | 0.77  

DD      
--------------------
chapter        | 0.86  
DDD            | 0.80  
december       | 0.75  
DDDD           | 0.74  
iii            | 0.74  
a.d.           | 0.74  
january        | 0.73  
v              | 0.72  
august         | 0.71  
vi             | 0.71  

DDDD    
--------------------
december       | 0.85  
a.d.           | 0.84  
june           | 0.81  
august         | 0.80  
]              | 0.79  
iii            | 0.79  
[           