# LAB2.3 Creating Wordembeddings from a text corpus

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we are going to show how you can build a word embedding model from a text corpus:

<ol>
<li>Obtain a text corpus from the web. We will use the Leipzig Corpora Collection that contains texts in different languages and was already preprocessed.
<li>Tokenize the text to get lists of individual words grouped by sentences. We use the NLTK toolkit and a specific tokenization function to do that.
<li>Create an embedding model from the tokenized text using the Gensim package.
<li>Demonstrate how to save the embedding model to disk and load it again for usage.
</ol>


## 1. Obtaining text from the Leipzig Corpora collection

The Leipzig corpora collection has corpora for over 250 languages. These corpora are collected from Wikipedia, news and web crawls and have been curated for research purposes.

For this notebook, you download a corpus in a language of you choice from:

http://wortschatz.uni-leipzig.de/en/download/

We will use the 'eng_news_2005_1M-text' corpus here for demonstration purposes. The files come as compressed ```tar``` files (extension ".tar.gz"). Depending on the decompression software you first need to decompress the file and next unpack the tar file (some software does that directly). 

Unpack the decompressed ```tar``` file somewhere on your computer. You will see it yields a directory, e.g. "end_news_2005_1M-text" with a number of files. For example, the files "...-sources.txt" contain the list of URLs from which the text was obtained preceded by an identifier and followed by the date of crawling:

```
1	http://davesipaq.com/articles/iPAQ_Plustek_portable_scanning_solution.html	2005-06-12
2	http://www.independent.com/cover/Cover959.htm	2005-04-08
3	http://www.insidebayarea.com/ci_2736737?rss	2005-05-15
4	http://www.dailycollegian.com/vnews/display.v/ART/2005/05/13/4282dbfadd830	2005-05-12
5	http://p2pnet.net/story/4856	2005-05-16
6	http://www.imf.org/external/np/tr/2005/tr050324a.htm	2005-04-09
```

The file "...-words.txt" contains the vocabulary of words with their frequency, e.g.:

```
452	law	5521
453	making	5514
454	record	5511
455	whether	5496
456	times	5488
457	St.	5485
458	scored	5484
459	taken	5484
```

We are going to use the file named "...-sentences.txt", which contains sentences separated on each line preceded by an identifier, e.g.:

```
1	I didn't know it was police housing," officers quoted Tsuchida as saying.
2	You would be a great client for Southern Indiana Homeownership's credit counseling but you are saying to yourself "Oh, we can pay that off."
3	He believes the 21st century will be the "century of biology" just as the 20th century was the century of IT.
4	They even call the civil rights organization a bit hypocritical.
```

Our goal is to use these sentences to create word embeddings. To be able to do that we need to process this file line by line, to obtain the tokens from each sentence and separate punctuation from each token. We are going to do this with the NLTK toolkit and define a specific function called 'preprocess_rawtext' that does all the work for us.

## Preprocessing function

What is a function? 

A function is an ordered sequence of instructions packaged into a group (like a recipe) with a name and possibly parameters between round brackets. So far you have been calling functions for instances of classes such as ```string``` or ```list``` that have been defined by other programners. You can however also define functions yourself. This is specifically useful if:

<ol>
<li> the code becomes too long and you want to group many smaller steps into higher level steps without bothering about what happens inside: e.g. like playing music instead of pushing piano keys.
<li> the code needs to be applied more than once and you do not want to repeat the code and make sure it is consistent across the repeated calls.
</ol>
    
The function that we define below calls other functions as well that we also need to define. So it is definitely a higher-order function.

Once defined, we only need to apply this function to a local file on our disk to carry out a whole series of instructions and we can easliy do this many times for all kinds of files in the same format, e.g. downloaded from the Leipzig website. The function guarantees that exactly the same process is applied each time.

The next cell contains the processing function. After your run the cell in your notebook, the function is available to do the work for you. This means it is defined but it has not been used yet. For that we need to apply it to something. We do that right after defining it.

For now, you can try to read and understand the function or just call it when you need it.

In [3]:
# We use the NLTK tokenization function to process the text
# For this we import the modules word_tokenize and sent_tokenize

from nltk.tokenize import word_tokenize, sent_tokenize
import string

#Function to remove punctuation from word tokens, 
#Takes a list of tokens as input

#Note that these functions only work if you also imported NLTK and string before calling the function
def remove_punct(tokens):
    # punct is a string with all punctuation tokens: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    punct = string.punctuation
    # empty list in which we put the clean tokens
    tokens_clean = []

    # Iterate over all characters in tokens 
    # and only keeps them if not in punct
    for t in tokens:
        if t not in punct:
            tokens_clean.append(t)
    # The result is a list with the cleaned tokens
    return tokens_clean

# The Leipzig corpus is already processed into sentences, so we do not need to split the text into sentences
# We can read it line by line but 
# we need to skip the first token in each line which is the identifier and not regular text

# Takes as input parameter the path to a file
def preprocess_leipzig_sentences(file):
    clean_sentences = []
    
    with open(file, "r") as i:
            for sentence in i:
                # We downcase each sentence, word_tokenize it with NLTK
                tokens = word_tokenize(sentence.lower())
                # We apply our custom remove_punct function and exclude the first token
                tokens_clean = remove_punct(tokens[1:]) # we skip the first token which is the identifier.
                # We add the clean tokens as a list to the list of sentences
                clean_sentences.append(tokens_clean)
                
    # The result is a list of lists, each representing the tokens of a sentence as elements
    return clean_sentences

# If you want to process other text than the Leipzig corpus that is not split into sentences,
# you can call the next function. The difference is:
# - we read the complete file as a text string
# - we apply the NLTK sent_tokenize function to the get a list of sentences
# - we do not need to remove the identifier
def preprocess_rawtext(file):
    clean_sentences = []

    with open(file) as infile:
        text = infile.read()
        
    sentences = sent_tokenize(text.strip())

    for sentence in sentences:
        tokens = word_tokenize(sentence.lower())
        tokens_clean = remove_punct(tokens)
        clean_sentences.append(tokens_clean)
    return clean_sentences


We now apply the above custom function to the Leipzig text corpus file with the sentences.

You need to adapt the path_to_the_corpus_file to the correct location of the file on your computer.
If the path is wrong you get an error message!

It takes a while before the whole file is processed.

In [4]:
#eng_news_2005_1M-sentences.txt
path_to_the_corpus_file='/Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/eng_news_2005_1M-text/eng_news_2005_1M-sentences.txt'
text_leipzigcorpus_clean = preprocess_leipzig_sentences(path_to_the_corpus_file)

We can inspect text_leipzigcorpus_clean by asking for its length and printing a small sample, in this case sentence 201 till 208. 

In [5]:
print('Number of sentences=',len(text_leipzigcorpus_clean))
#we print a few sentences to see how it looks like
print(text_leipzigcorpus_clean[201:208])

Number of sentences= 1000000
[['his', 'forehead', 'is', 'fractured', 'in', 'several', 'places', 'and', 'his', 'brain', 'and', 'one', 'of', 'his', 'lungs', 'are', 'bruised', 'she', 'said'], ["''", 'their', 'reputation', 'is', 'totally', 'vindicated', "''", 'loevy', 'said'], ['he', 'also', 'was', 'administratively', 'charged', 'with', 'breaking', 'state', 'law', 'lying', 'and', 'failing', 'to', 'report', 'information', 'to', 'the', 'department', 'in', 'the', 'jude', 'beating'], ["''", 'the', 'mta', 'were', 'directed', 'to', 'make', 'certain', 'amendments', 'to', 'their', 'constitution', 'to', 'ensure', 'clubs', 'are', 'directly', 'affiliated', 'to', 'the', 'national', 'body', 'with', 'voting', 'rights', "''", 'said', 'elyas'], ['both', 'last', 'raced', 'in', 'the', 'florida', 'derby', 'on', 'april', '2'], ['they', 'were', 'fifth', 'last', 'year', 'in', 'prague', 'fourth', 'in', '2003', 'at', 'helsinki', 'and', 'fifth', 'in', '2002', 'at', 'goteborg', 'sweden'], ['nicklaus', 'said', 'fare

## Training word embeddings

To train a language model with word embeddings, we will use the **gensim** package. 

In order to build the word embeddings through gensim, we are going to use its **Word2Vec** function.

Gensim allows us to set a number of parameters for training. The most important of these are `min_count`, `window`, `size` and `sg`:

* `min_count` is the minimum frequency of the words in our corpus. For infrequent words, we just don't have enough information to train reliable word embeddings. It therefore makes sense to set this minimum frequency to at least 10. In these experiments, we'll set it to 100 to limit the size of our model even more and to speed up things.
* `window` is number of words to the left and to the right that make up the context that word2vec will take into account to make predictions.
* `size` is the dimensionality of the word vectors. This is generally between 100 and 500. You often have to make a trade-off: embeddings with a higher dimensionality are able to model more information, but also need more data to train.
* `sg`: there are two algorithms to train word2vec: skip-gram and CBOW. Skip-gram tries to predict the context on the basis of the target word; CBOW tries to find the target on the basis of the context. By default, Gensim uses CBOW (```sg=0```). To use skip-gram set ```sg=1```).

We'll investigate the impact of some of these parameters later.

The next command creates an embedding model from our cleaned corpus. The model is assigned to the variable 'englishleipzig_w2v'(any name will do) and can be used next in this notebook. We also save the embedding model to disk as 'txt' file and as 'binary' data (bin) so that we can load it later and do not need to build the model over and over again in each notebook.

In [6]:
# You need to do the next commands only once. When you have succesfully created and saved the embeddings you can load them afterwards
from gensim.models import Word2Vec
englishleipzig_w2v = Word2Vec(text_leipzigcorpus_clean, vector_size=100, window =4, min_count =100)

Please note that every time we train a model, even with the same data, the resulting embeddings will be slightly different. This is because the neural network will use different random seeds to initialize its weights. The details of this go beyond what you will learn in this lab, but keep in mind that when you run this notebook your results might be different in the details, but the general trends should hold. For example, the similarity scores between 'king' and 'queen' might not be exactly the same, but the most similar words for 'king' will be mostly the same and in the same ranked order.

After the model is built, you can save it to disk for future usage. This may be handy if for some reason the notebook is killed or gets stuck.

We use the function **wv.save_word2vec_format** to save the model. Make sure the folder exist in the path you specify.

We can save the model as a text file or as a binary file. The binary file loads faster but you could have problems porting it from machine to machines with different OS. The text file you can load in a text editor and inspect!

In [7]:
englishleipzig_w2v.wv.save_word2vec_format('/Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/models/eng_news_2005_1M-sentences.txt')
englishleipzig_w2v.wv.save_word2vec_format('/Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/models/eng_news_2005_1M-sentences.bin', binary=True)

If you have a powerful plain text editor, you can open the txt file and inspect it. You can also use the command line and type the following command:

In [8]:
%cat /Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/models/eng_news_2005_1M-sentences.txt | more

12336 100
the 1.4171938 0.19886859 0.15757857 1.0682825 -1.2170258 0.47073677 0.5311779 -1.1778394 -0.19688056 0.4274103 -0.038369786 0.35101968 1.1600295 0.058567423 0.15782231 0.09109298 1.3882715 1.55663 -0.058293235 0.5576438 0.2610913 0.2629073 0.694968 0.5450368 0.8548721 -0.8205738 -1.9545425 -0.1743459 -0.7750079 0.28192964 0.9474471 0.2267074 -0.15550677 -0.05284116 1.006297 -0.60006493 0.71066123 1.2553871 -0.57392114 -0.26796442 -1.9194419 0.40045175 -1.148668 -0.4620119 -0.5133186 2.1878824 0.42352217 0.4531833 -0.7956174 -0.5827346 -1.5622813 1.2775507 -1.2018695 -0.20327899 0.64684075 -0.21247993 0.22135949 -0.09313451 0.6530807 0.84720755 0.064873494 -0.2041589 -1.3471535 0.46274233 1.8550614 -0.2309431 -0.37530425 -0.81255424 0.42157054 0.50462186 0.5017427 0.9819089 -0.5339767 -0.25220793 -0.14769663 0.38650087 0.08987247 0.5337634 0.932526 -0.4588827 0.76393574 -1.5314038 -0.7657424 1.5264592 -0.3588032 -0.18507506 0.28125474 -1.6985829 0.10605182 0.7804757 -0.905671 

Note that you need to stop the previous cell manually in this notebook because the "more" command only shows the beginning of the file and waits for an enter to continue or ctrl-c for cancel. You stop the cell by clicking on the square next to the play symbol in the menu of the notebook.

The first line has two numbers: the first is the size of the vocabulary and the second is the number of dimensions or the length of the vectors. Both depend on the parameters you used to build the model. The file contains a line for each word with its embedding representation. Depending on the parameters used, you may see the embeddings for the very frequent words "the" and "to" on the first lines.

## Using word embeddings

Now we saved our model to disk, we can load it any time and use it. The next time you launch this notebook, you do not need to collect and preprocess the corpus and build a model from it. You can load the model directly from the location where you saved it. That's what we are going to do now.

In [7]:
# How to load a stored model:
from gensim.models import KeyedVectors

# You can load it either as text or as binary data. 
#The latter is more efficient but you may not be able to port it from machine to machine.
englishleipzig_w2v = KeyedVectors.load_word2vec_format('/Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/models/eng_news_2005_1M-sentences.txt') 

Notice that loading is much faster than building! Let's check some of the properties of the englishleipzig_w2v model: 

In [8]:
# Show some properties of our model. Notice these are also in the text file.
print('Vector size =', englishleipzig_w2v.vector_size)
print('Vocabulary size =', len(englishleipzig_w2v.key_to_index))

Vector size = 100
Vocabulary size = 12336


We have limited the dimensions to '100' which is the vector size and the vocabulary is much smaller than the Wikipedia vocabulary and even smaller than WordNet. We can now use any word from the vocabulary as a key to obtain the vector:

In [9]:
king_vector = englishleipzig_w2v["king"]
print(len(king_vector))
print(king_vector)

100
[ 1.5469513   1.6860337   0.12301197  0.58989465  0.2739735  -0.4105389
  0.41100827  0.37610853  0.03646443 -0.38568762 -0.48197266 -0.88005334
  0.4279704   0.03546716  1.5243983   0.18495554  0.41031778 -0.6879221
  1.2044296  -0.25107986  1.1485189   0.5648805   1.8861083  -0.4446929
  0.1834296  -1.1384511  -0.30667162 -0.006758    0.12008601  0.71690995
  0.18232372 -0.30673692  0.9085713  -0.5810559   1.2307783  -0.60112196
 -0.5376805   1.4571023   0.30190334 -1.5316579   1.1415538  -0.6962501
  0.42538154 -0.62837195  0.5986211  -1.4499973   0.46742278 -1.3125943
 -0.29398197 -0.5775949  -0.6936858  -0.21864018 -1.8093103  -1.8649422
 -0.76891977 -0.62112063  1.9832606  -0.579999   -0.6700343   0.32070187
 -2.226239   -1.0541159   1.1721869   0.02004264  0.705233    0.28031504
  0.01082416  1.5786729  -0.04534668 -0.12843108 -0.7548811   0.12520216
  1.1254116   0.5978487   0.6776862   0.7284394   0.47489783  0.39948747
 -0.87342185 -0.25798327  0.36016706 -0.25010848 -0.8

We see we get a dense vector with values for all 100 dimensions.

We can also easily find the similarity between two words. As expected, the figures below show that *king* is closer to *queen* than to *coffee*.

In [10]:
print(englishleipzig_w2v.similarity("king", "queen"))
print(englishleipzig_w2v.similarity("king", "coffee"))

0.6832746
0.08221461


In a similar vein, we can find the words that are most similar to a target word. The words with the most similar embedding to *king* are all similar titles (such as *prince*  and *queen*) or they are semantically related to royalty.

In [11]:
englishleipzig_w2v.similar_by_word("king", topn=10)

[('prince', 0.7564889788627625),
 ('queen', 0.6832746863365173),
 ('alexander', 0.6774353981018066),
 ('rainier', 0.6738333702087402),
 ('buchanan', 0.6554375886917114),
 ('martin', 0.6489624381065369),
 ('henry', 0.6395936012268066),
 ('elizabeth', 0.6394978165626526),
 ('bishop', 0.6369951367378235),
 ('diana', 0.6360077261924744)]

Note that this model was trained from the Leipzig news corpus for English, which is not that big! Companies such as Google, Amazon and Facebook train their models on many magnitudes more data. Much bigger corpora in many languages can be found at: https://commoncrawl.org

Note that training a model on such data sets also requires a powerful computing infrastructure.

# End of notebook