# LAB2.3 Creating embeddings from text in some language


Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we are going to show how you can create word embeddings from a text collection for a specific language:

<ol>
<li>Obtain a text corpus from the web. We will use the Leipzig Corpora Collection that contains texts in many languages and was already preprocessed.
<li>Tokenize the text to get the individual words in sentences as a list. We use the NLTK toolkit and a specific tokenization function to do that.
<li>Create an embedding model from the tokenized text using the Gensim package
<li>Demonstrate how to use the embedding model
<li>Show how the word embedding space can be visualised
</ol>

https://radimrehurek.com/gensim/auto_examples/index.html#core-tutorials-new-users-start-here

## 1. Obtaining text from the Leipzig Corpora collection

The Leipzig corpora collection has corpora for over 250 languages. These corpora are collected from Wikipedia, news and web crawls. Download a corpus in a language from:

http://wortschatz.uni-leipzig.de/en/download/

We will use the 'eng_news_2005_1M-text' corpus in this notebook. 

Unpack the compressed file somewhere on your computer. You will see it contains a number of files in a folder that have been created by the Leipzig NLP group from the sources. For example, the files "...-sources.txt" contain the list of URLs from which the text was obtained preceded by an identifier and followed by the date of crawling:

```
1	http://davesipaq.com/articles/iPAQ_Plustek_portable_scanning_solution.html	2005-06-12
2	http://www.independent.com/cover/Cover959.htm	2005-04-08
3	http://www.insidebayarea.com/ci_2736737?rss	2005-05-15
4	http://www.dailycollegian.com/vnews/display.v/ART/2005/05/13/4282dbfadd830	2005-05-12
5	http://p2pnet.net/story/4856	2005-05-16
6	http://www.imf.org/external/np/tr/2005/tr050324a.htm	2005-04-09
```

The "...-words.txt" file contains the vocabulary of words with their frequency, e.g.:

```
452	law	5521
453	making	5514
454	record	5511
455	whether	5496
456	times	5488
457	St.	5485
458	scored	5484
459	taken	5484
```

We are going to use the file named "...-sentences.txt", which contain a sentence on each line preceded by an identifier, e.g.:

```
1	I didn't know it was police housing," officers quoted Tsuchida as saying.
2	You would be a great client for Southern Indiana Homeownership's credit counseling but you are saying to yourself "Oh, we can pay that off."
3	He believes the 21st century will be the "century of biology" just as the 20th century was the century of IT.
4	They even call the civil rights organization a bit hypocritical.
```

Our goal is to use these sentences to create word embeddings. To be able to do that we need to process this file line by line, obtain the tokens from each sentence and separate punctuation from each token. We are going to do this with the NLTK toolkit and define a specific function called 'preprocess_rawtext' that does all the work for us.

What is a function? A function is an ordered sequence of commands packaged into a group (like a recipe) with a name and possibly parameters between round brackets. So far you have been calling functions for instances of classes such as string or list that have been defined by other programners. You can however also define functions yourself. This is specifically useful if:

<ol>
<li> the code becomes too long and you want to group smaller steps into higher level steps without bothering about what happens inside: e.g. like playing music instead of pushing piano keys
<li> code needs to be applied more than once and you do not want to repeat the code and make sure it is consistent across the repeated calls.
</ol>
    
The function that we define below calls other functions as well that we also need to define, so you can see it is definitely a higher-order function. 

Once defined, we only need to apply this function to a local file on our disk to carry out a whole series of instructions and we can easliy do this many times for all kinds of files in the same format, e.g. downloaded from the Leipzig website. The function guarantees that the same process is applied each time.

The next cell contains the processing function. After your run the cell in your notebook, the function is available to do the work for you. This means it is defined but it has not been used yet. For that we need to apply it to something. We do that later.

For now, you can try to read and understand the function or just call it when you need it.

## Preprocessing function

In [1]:
# We use the NLTK tokenization function to process the text
# For this we import the modules word_tokenize and sent_tokenize

from nltk.tokenize import word_tokenize, sent_tokenize
import string

#Function to remove punctuation from word tokens, 
#Takes a list of tokens as input

#Note that these functions only work if you also imported NLTK and string before calling the function
def remove_punct(tokens):
    # punct is a string with all punctuation tokens: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    punct = string.punctuation
    # empty list in which we put the clean tokens
    tokens_clean = []

    # Iterate over all characters in tokens 
    # and only keeps them if not in punct
    for t in tokens:
        if t not in punct:
            tokens_clean.append(t)
    # The result is a list with the cleaned tokens
    return tokens_clean

# The Leipzig corpus is already processed into sentences, so we do not need to split the text into sentences
# We can read it line by line but 
# we need to skip the first token in each line which is the identifier and not regular text

# Takes as input parameter the path to a file
def preprocess_leipzig_sentences(file):
    clean_sentences = []
    
    with open(file, "r") as i:
            for sentence in i:
                # We downcase each sentence, word_tokenize it with NLTK
                tokens = word_tokenize(sentence.lower())
                # We apply our custom remove_punct function and exclude the first token
                tokens_clean = remove_punct(tokens[1:]) # we skip the first token which is the identifier.
                # We add the clean tokens as a list to the list of sentences
                clean_sentences.append(tokens_clean)
                
    # The result is a list of lists, each representing the tokens of a sentence as elements
    return clean_sentences

# If you want to process other text than the Leipzig corpus that is not split into sentences,
# you can call the next function. The difference is:
# - we read the complete file as a text string
# - we apply the NLTK sent_tokenize function to the get a list of sentences
# - we do not need to remove the identifier
def preprocess_rawtext(file):
    clean_sentences = []

    with open(file) as infile:
        text = infile.read()
        
    sentences = sent_tokenize(text.strip())

    for sentence in sentences:
        tokens = word_tokenize(sentence.lower())
        tokens_clean = remove_punct(tokens)
        clean_sentences.append(tokens_clean)
    return clean_sentences


We now apply the above custom function to the Leipzig text corpus file with the sentences.

You need to adapt the path_to_the_corpus_file to the correct location of the file on your computer.
If the path is wrong you get an error message!

It takes a while before the whole file is processed. Get a coffee or cup of tea!

In [2]:
#eng_news_2005_1M-sentences.txt
path_to_the_corpus_file='/Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/eng_news_2005_1M-text/eng_news_2005_1M-sentences.txt'
text_leipzigcorpus_clean = preprocess_leipzig_sentences(path_to_the_corpus_file)

We can inspect text_leipzigcorpus_clean by asking for its length and printing a small sample, in this case sentence 201 till 208. 

In [3]:
print('Number of sentences=',len(text_leipzigcorpus_clean))
#we print a few sentences to see how it looks like
print(text_leipzigcorpus_clean[201:208])

Number of sentences= 1000000
[['his', 'forehead', 'is', 'fractured', 'in', 'several', 'places', 'and', 'his', 'brain', 'and', 'one', 'of', 'his', 'lungs', 'are', 'bruised', 'she', 'said'], ["''", 'their', 'reputation', 'is', 'totally', 'vindicated', "''", 'loevy', 'said'], ['he', 'also', 'was', 'administratively', 'charged', 'with', 'breaking', 'state', 'law', 'lying', 'and', 'failing', 'to', 'report', 'information', 'to', 'the', 'department', 'in', 'the', 'jude', 'beating'], ["''", 'the', 'mta', 'were', 'directed', 'to', 'make', 'certain', 'amendments', 'to', 'their', 'constitution', 'to', 'ensure', 'clubs', 'are', 'directly', 'affiliated', 'to', 'the', 'national', 'body', 'with', 'voting', 'rights', "''", 'said', 'elyas'], ['both', 'last', 'raced', 'in', 'the', 'florida', 'derby', 'on', 'april', '2'], ['they', 'were', 'fifth', 'last', 'year', 'in', 'prague', 'fourth', 'in', '2003', 'at', 'helsinki', 'and', 'fifth', 'in', '2002', 'at', 'goteborg', 'sweden'], ['nicklaus', 'said', 'fare

## Training word embeddings

To train a language model with word embeddings, we will use the **gensim** package again. 

In order to build the word embeddings through gensim, we are going to use the Word2Vec function that is included in gensim. Word2Vec is the Google package that provided a break-through in the performance of embeddings (Mikolov et al. 2013). Check its citations in Google scholar!

    Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 
    "Distributed representations of words and phrases and their compositionality." 
    In Advances in neural information processing systems, pp. 3111-3119. 2013.

When we train our embeddings, gensim allows us to set a number of parameters. The most important of these are `min_count`, `window`, `size` and `sg`:

* `min_count` is the minimum frequency of the words in our corpus. For infrequent words, we just don't have enough information to train reliable word embeddings. It therefore makes sense to set this minimum frequency to at least 10. In these experiments, we'll set it to 100 to limit the size of our model even more and to speed up things.
* `window` is number of words to the left and to the right that make up the context that word2vec will take into account to make predictions.
* `size` is the dimensionality of the word vectors. This is generally between 100 and 500. You often have to make a trade-off: embeddings with a higher dimensionality are able to model more information, but also need more data to train.
* `sg`: there are two algorithms to train word2vec: skip-gram and CBOW. Skip-gram tries to predict the context on the basis of the target word; CBOW tries to find the target on the basis of the context. By default, Gensim uses CBOW (`sg=0`).

We'll investigate the impact of some of these parameters later.

The next command creates an embedding model from our cleaned corpus. The model is assigned to the variable 'englishleipzig_w2v'(any name will do) and can be used next in this notebook. We also save the embedding model to disk as 'txt' file and as 'binary' data (bin) so that we can load it later and do not need to build the model over and over again.

In [4]:
# You need to do the next commands only once. When you have succesfully created and saved the embeddings you can load them afterwards
from gensim.models import Word2Vec
englishleipzig_w2v = Word2Vec(text_leipzigcorpus_clean, vector_size=100, window =4, min_count =100)

The resulting model has a lot of different functions, explained in the documentation of gensim. Please note that everytime we train a model, even with the same data, the resulting embeddings will be slightly different. This is because the neural network will use different random seeds to initialize its weights. The details of this go beyond what you will learn in this lab, but keep in mind that when you run this notebook your results might be different in the details, but the general trends should hold. For example, the similarity scores between 'woman' and 'king' might not be exactly the same, but the most similar words for 'king' will be mostly the same. 

After the model is built, you can save it to disk for future usage. This may be handy if for some reason the notebook is killed or gets stuck and do you do not like drinking too much tea.

We use the function *wv.save_word2vec_format* to save the model. I am storing it in a subfolder "models". Make sure the folder exist in the path you specify.

We can save the model as a text file or as a binary file. The binary file loads faster but you could have problems porting it from machine to machines with different OS. The text file you can load in a text editor and inspect!

In [5]:
englishleipzig_w2v.wv.save_word2vec_format('/Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/models/eng_news_2005_1M-sentences.txt')
englishleipzig_w2v.wv.save_word2vec_format('/Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/models/eng_news_2005_1M-sentences.bin', binary=True)

If you have a powerful plain text editor, you can open the txt file and inspect it. You can also use the command line and type the following command:

In [6]:
%cat /Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/models/eng_news_2005_1M-sentences.txt | more

12339 100
the 1.5022856 0.098328725 0.5024047 1.5309852 -0.09105321 0.25754008 1.0062594 -1.8975921 -0.2943641 1.6874166 -0.5590369 0.066697024 0.65090406 -0.5807375 0.51481295 1.2120371 0.74987215 0.45058954 0.47719938 1.4547876 1.0712636 0.93635213 0.1859579 -0.14195006 1.4503254 -1.1170188 -1.8843431 0.6140086 -0.95290875 0.1610322 0.6690898 0.5791449 -0.6998886 0.377161 -0.0649989 -0.43829533 0.79310054 0.22689182 0.18690485 1.2593929 -2.032061 -0.037656322 -0.5216925 -0.034906916 0.403945 0.82607985 -0.36427703 0.17960487 0.8871468 -0.57342553 -1.0268893 0.0016339723 -1.237817 -0.045479223 -0.5588587 -0.35184065 0.45267746 -0.6730506 0.24810201 1.0806748 0.049607046 0.16131048 -0.70704466 -0.4510508 0.45796368 0.5533735 0.3766744 -1.2691549 0.27146065 0.24675971 -0.4950922 0.9270032 -0.6861244 0.31925648 0.123968504 0.5584019 0.8135406 1.004157 0.38755992 0.4915214 -1.0176872 -1.8164865 0.6402584 1.8562845 -0.8639606 1.0418053 -0.5464921 -0.8213556 -0.6610205 0.9182424 -0.6670324 

Note that you need to stop the previous cell manually in this notebook because the "more" command only shows the beginning of the file and waits for an enter to continue or ctrl-c for cancel. You stop the cell by clicking on the square next to the play symbol in the menu of the notebook.

The first line has two numbers: the first is the size of the vocabulary and the second is the number of dimensions or the length of the vectors. Both depend on the parameters you used to build the model. The file contains a line for each word with its embedding representation. Depending on the parameters used, you may see the embeddings for the very frequent words "the" and "to" as the first lines.

## Using word embeddings

Now we saved our model to disk, we can load it any time and use it. The next time you launch this notebook, you do not need to collect and preprocess the corpus and build a model from it. You can load the model directly from the location where you saved it. That's what we are going to do now.

In [7]:
# How to load a stored model:
from gensim.models import KeyedVectors

# You can load it either as text or as binary data. 
#The latter is more efficient but you may not be able to port it from machine to machine.
englishleipzig_w2v = KeyedVectors.load_word2vec_format('/Users/piek/Desktop/t-ONDERWIJS/data/leipzig-corpora/models/eng_news_2005_1M-sentences.txt') 

Notice that loading is much faster than building! Let's check some of the properties of the englishleipzig_w2v model: 

In [8]:
# Show some properties of our model. Notice these are also in the text file.
print('Vector size =', englishleipzig_w2v.vector_size)
print('Vocabulary size =', len(englishleipzig_w2v.key_to_index))

Vector size = 100
Vocabulary size = 12339


We have limited the dimensions to '100' which is the vector size and the vocabulary is much smaller than the Wikipedia vocabulary and even smaller than WordNet. We can now use any word from the vocabulary as a key to obtain the vector:

In [9]:
king_vector = englishleipzig_w2v["king"]
print(len(king_vector))
print(king_vector)

100
[ 0.6905258   1.5594054   0.5748212   1.5657576  -0.1534916  -0.40360108
  0.3100164   1.0694386   1.2010936   0.00503622  0.77189213 -0.99546665
 -0.3815703  -0.09455451  0.42443958 -0.83974576  0.04821923 -0.44235227
 -0.3570004  -2.0702674   2.3530064   0.53084767  1.750268   -0.9308972
 -0.01893806 -1.2629433  -0.36071002 -0.84374225  1.1177106  -0.11203402
  0.62249374 -0.10188994  0.21202749 -0.21265593  1.1802895  -0.23113143
 -0.24272048  0.3584279   0.3905339  -0.84558505  1.1705464  -1.819815
 -0.14620385 -0.7416933   0.66355807 -0.4930548  -0.30910316 -0.8348972
 -1.3058763   0.40529746 -0.44586256  0.0328793  -1.0537902  -0.6687268
 -0.2753386  -1.774672    0.67011094  0.2345644  -0.6156653  -0.18520372
 -1.2395182   1.0801073   0.40914643  0.33789864 -0.7158897   0.6823043
  0.0744905   0.99532855 -0.62204605  0.32310545 -0.1312581   0.10557448
  1.04885    -0.08423143 -0.05755047  0.6271566  -0.523815    0.6260166
  0.19405621 -1.4696909  -0.9468328  -0.68338794  0.28

We see we get a dense vector with values for all 100 dimensions.

We can also easily find the similarity between two words. Similarity is measured as the cosine between the two word embeddings, and ranges between -1 and +1. The higher the cosine, the more similar two words are. As expected, the figures below show that *king* is closer to *queen* than to *coffee*.

In [10]:
print(englishleipzig_w2v.similarity("king", "queen"))
print(englishleipzig_w2v.similarity("king", "coffee"))

0.69186014
0.09530553


In a similar vein, we can find the words that are most similar to a target word. The words with the most similar embedding to *king* are all similar titles (such as *prince*  and *queen*) or they are semantically related to royalty.

In [11]:
englishleipzig_w2v.similar_by_word("king", topn=10)

[('prince', 0.7474125027656555),
 ('buchanan', 0.7114241719245911),
 ('queen', 0.6918601393699646),
 ('mccartney', 0.6681753993034363),
 ('fraser', 0.6624840497970581),
 ('emperor', 0.6491478681564331),
 ('baldwin', 0.647771954536438),
 ('rainier', 0.6466833353042603),
 ('dawson', 0.6439461708068848),
 ('vincent', 0.6431512832641602)]

Note that this model was trained from the Leipzig news corpus for English, which is not that big! Companies such as Google, Amazon and Facebook train their models on many magnitudes more data. Much bigger corpora in many languages can be found at: https://commoncrawl.org

Note that training a model on such data sets also requires a powerful computing infrastructure.

Interestingly, we can look for words that are similar to a set of words and dissimilar to another set of words at the same time. This allows us to look for analogies of the type *king is to man like ... is to woman*.

In [12]:
englishleipzig_w2v.most_similar(positive=['woman', 'king'], negative=["man"], topn=10)

[('queen', 0.726746141910553),
 ('prince', 0.6802133321762085),
 ('anne', 0.6743527054786682),
 ('caroline', 0.6428135633468628),
 ('diana', 0.6412346959114075),
 ('buchanan', 0.6354133486747742),
 ('mary', 0.6349809169769287),
 ('elizabeth', 0.632943868637085),
 ('wong', 0.6183045506477356),
 ('rainier', 0.6174156665802002)]

We see that *queen* (scoring slightly higher) and *prince* (scoring significantly lower). More female names are included and male names are scoring lower or even disappeared.

## Words are not word meanings!

What happens with ambiguous words?

If we take an ambiguous word such as **mouse**, we get devices but not similar animal terms such as 'rat'.

In [13]:
englishleipzig_w2v.most_similar(positive=["mouse"], topn=10)

[('keyboard', 0.712760329246521),
 ('printer', 0.6471728682518005),
 ('laptop', 0.6370227932929993),
 ('blade', 0.6339210867881775),
 ('clips', 0.6317967772483826),
 ('monkey', 0.6317648887634277),
 ('robot', 0.631564736366272),
 ('bits', 0.630458414554596),
 ('lens', 0.6261634826660156),
 ('colour', 0.6257264614105225)]

If we however subtract the vector for **computer** or **keyboard**, we can try to get more animal related terms

In [14]:
englishleipzig_w2v.most_similar(positive=["mouse"], negative=["computer"], topn=10)

[('smell', 0.5303334593772888),
 ('lovely', 0.5213074088096619),
 ('colours', 0.5054668188095093),
 ('dark', 0.5025986433029175),
 ('roses', 0.5024785995483398),
 ('monkey', 0.500421941280365),
 ('fried', 0.4910123646259308),
 ('beats', 0.4807605445384979),
 ('hell', 0.473664790391922),
 ('potter', 0.4730132818222046)]

Not really a good result although *keyboard* is no longer on top. This may be due to the size of the corpus used and the settings that were used to build it.

Finally, we can present the word2vec model with a list of words and ask it to identify the odd one out. It then uses the word embeddings to identify the word that is least similar to the other ones. For example, in the list *car, bike, bus, coffee*, it correctly identifies *coffee* as the odd one out. In the list *coffee, car, tea, milk*, it correctly singles out *car*.

In [15]:
print(englishleipzig_w2v.doesnt_match("car bike bus coffee".split()))
print(englishleipzig_w2v.doesnt_match("coffee bus bike tea milk".split()))

coffee
bus


## Plotting embeddings [Advanced]

Let's now visualize some of our embeddings. To plot embeddings with a dimensionality of 100 or more, we first need to map them to a dimensionality of 2. We do this with the popular [t-SNE](https://lvdmaaten.github.io/tsne/) method. T-SNE, short for t-distributed Stochastic Neighbor Embedding, helps us visualize high-dimensional data by mapping similar data to nearby points and dissimilar data to distance points in the low-dimensional space.

T-SNE is included in [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). If you have not installed scikit-learn you need to do that first. Follow the instructions at: https://scikit-learn.org/stable/install.html to do so or run `conda install scikit-learn`. Note that we will use scikit learn anyway next week.

To run the TSNE package within Scikit-learn, we just have to specify the number of dimensions we'd like to map the data to (`n_components`), and the similarity metric that t-SNE should use to compute the similarity between two data points (`metric`). We're going to map to 2 dimensions and use the cosine as our similarity metric. Additionally, we use PCA as an initialization method to remove some noise and speed up computation. The [Scikit-learn user guide](https://scikit-learn.org/stable/modules/manifold.html#t-sne) contains some additional tips for optimizing performance. 

Plotting all the embeddings in our vector space would result in a very crowded figure where the labels are hardly legible. Therefore we'll focus on a subset of embeddings by selecting the 'n' most similar words to a target word. In the example below, we set the target word to 'moon' and we draw only the graph for the 50 most related words.

In order to visualise a model and its words we need an auxiliary function that is given below. Run the cell below to make this function 'tsne_plot_target_word' available to your notebook. 

You might also need to install matplotlib (a library for plotting) and pandas (a library for data analysis). You can do so in the usual way:

`conda install matplotlib`

`conda install pandas`

In [16]:
%conda install matplotlib
%conda install pandas

ValueError: The python kernel does not appear to be a conda environment.  Please use ``%pip install`` instead.

We next define an auxiliary function that given a model and a selection of similar words generates the 2-dimentional TSNE space and saves it to a file with name of the target_word and the number of neighbours.

In [None]:
#https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne
#https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline

def tsne_plot_target_word(model, selected_words, target_word):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for i, word in enumerate(selected_words):
        tokens.append(model[i])
        labels.append(word)

    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.title('Word embedding space:'+target_word)
    plt.savefig(target_word+str(len(selected_words))+'_englishleipzig_w2v.pdf', dpi=600, transparent='true', bbox_inches='tight')
    plt.show()

In [None]:
%matplotlib inline

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

from sklearn.manifold import TSNE

# We select a set of 50 words most similar to 'moon' and store the result as 'selected_words'
count=50
target_word = "moon"
selected_words = [w[0] for w in englishleipzig_w2v.most_similar(positive=[target_word], topn=count)]
print('The 50 most similar words to "moon" in a our model:', selected_words)    


Next, we obtain the vector representation for the top 50 words and create an array of vector arrays with scores for each dimension.

In [None]:
embeddings = [englishleipzig_w2v[w] for w in selected_words]

Let's inspect the first two elements. It is indeed a list with vector arrays.

In [None]:
print(embeddings[0:2])

Next we call the TSNE function with the parameters mentioned before and we call a fit_transform(embeddings) function to fit in our selected embeddings. This will apply the cosine similarity to all embeddings and plot these in two dimensions

In [None]:
mapped_embeddings = TSNE(n_components=2, metric='cosine', init='pca').fit_transform(embeddings)

Now we can apply our custom made function 'tsne_plot_target_word' to the mapped_embeddings:

In [None]:
tsne_plot_target_word(embeddings, selected_words,target_word)

That looks nice! So our target word 'moon' is located at the coordinates [0,0]. Inspect the diagram to see what words are close. OK-ish. Perhaps, it works better with 300 dimensions!

### Displaying the full graph

To display the whole space we rebuild the word2vec model and limit the vocabulary to frequency of 500 or more and reduce the dimensions to 50. This may still take a while to complete. The result may look cluttered, which is why we save it to a PDF file to expand.

In [None]:
from gensim.models import Word2Vec
count=500

englishleipzig_w2v = Word2Vec(text_leipzigcorpus_clean, vector_size =50, window =4, min_count =count)

To save the result to disk, we need another customized function:

In [None]:
#https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne
#https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline

def tsne_plot(model, name):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.index_to_key:
        tokens.append(model.wv[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.title('Word embedding space:'+name)
    plt.savefig(name, dpi=600, transparent='true', bbox_inches='tight')
    plt.show()

We can now call the new customized function 'tsne_plot' to the full model. It took a while on my computer to save it. If you dont manage, you can inspect the PDF that is included in the LAB distribution.

In [None]:
tsne_plot(englishleipzig_w2v, 'moon_englishleipzig_w2v.pdf')

# End of notebook