<a href="https://colab.research.google.com/github/dgromann/cl_intro/blob/main/tutorials/Tutorial3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 3: Introduction to Computational Linguistics

This is the second tutorial with practical exercises for the lecture Introduction to Computational Linguistics in the winter semester 2023. Hands-on exercises are marked with 👋 ⚒ and questions are marked with ❓. Remember to first **store this notebook** in your Drive or GitHub.

---

## **Lesson 3: Word Embeddings**

A vector representation of words trained with a neural network is called a word embedding. The most popular method for training embeddings is called word2vec, which is an unsupervised method of training embeddings from large natural language corpora.


`word2vec` literature:
  - Mikolov, T.,  Chen, K., Corrado, G., & Dean, J. (2013). [Efficient estimation of word representations in vector space](https://arxiv.org/abs/1301.3781). Corr abs/1301.3781.
  - Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). [Distributed representations of words and phrases and their compositionally](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). *Advances in neural information processing systems*. 2013.


- Other variants of embeddings training:
  - `fasttext` from Facebook
  - `GloVe` from Stanford NLP Group
- There are many ways to train work embeddings.
  - `gensim`: Simplest and straightforward implementation of `word2vec`.
  - Training based on deep learning packages (e.g., `keras`, `tensorflow`)
  - `spacy` (It comes with the pre-trained embeddings models, using GloVe.)
- See Sarkar (2019), Chapter 4, for more comprehensive reviews.

If we assume our training corpus contains 10,000 unique words, our input vector to the model to train embeddings with skipgram has 10,000 dimensions, one for each word in the vocabulary.

In the input vector all dimensions are zero but one, which indicates the word in the vocabulary, e.g. "ant" in the example below. This is why these vectors are called one-hot encodings.


Source of the following image: [Chris McCormik Tutorial](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)




![architecture](http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png)

When multiplying a matrix on the hidden layer with the one-hot encoding, we obtain one row of the matrix. So the matrix serves as a lookup table for embeddings.

![matrix_mult](http://mccormickml.com/assets/word2vec/matrix_mult_w_one_hot.png)

The output layer is a softmax regression classifier, which changes the rates of the weights into a range between 0 and 1, where the sum of all dimensions add up to 1. The final output layer indicates the probabilty of context words for the input, e.g. how high is the probability of "ability" occuring near to "ant".

Since learning embeddings based on all context words anywhere near a word in a text is inefficient, a window size is selected to limit the number of context words that are considered during training:

![word2vec_skipgrams](https://tensorflow.org/text/tutorials/images/word2vec_skipgram.png)

### Using Pre-Trained Embeddings

The code below exemplifies how to load a trained embedding model in the gensim library.

In [None]:
# Let's first load a small subset of word2vec embeddings that have been trained on a
# large corpus of news documents
!wget https://github.com/dgromann/SemanticComputing/raw/master/tutorial6/word2vec_embeddings.bin
!wget https://raw.githubusercontent.com/dgromann/cl_intro/master/tutorials/Tutorial3.txt
!pip3 install gensim

In [None]:
import gensim
import numpy as np

from sklearn.decomposition import PCA
from matplotlib import pyplot as plt

# Let's load the model
model = gensim.models.KeyedVectors.load_word2vec_format("word2vec_embeddings.bin.3", binary=True)

In [None]:
# Print the length fo the whole vocabulary
print("Length of the vocabulary",len(model.key_to_index))

# Print the embedding of a specific word
print("Embedding for the word good: ", model["good"])

👋 ⚒ How many dimensions (numbers) does each vector in this trained embedding model have? Try to find this out with code, not by counting.

In [None]:
#Your code goes here

Let's use embeddings to evaluate how similar two words are.

👋 ⚒ How can we get the first most similiar word of good from the list of the top 5 most similar words?

In [None]:
print(model.most_similar('good'))

# Get the top 5 most similar words of "good"
most_similar = model.most_similar("good", topn=5)
print(most_similar)

# Your code to get the first word from that list of top 5


We can also use these embeddings to obtain similar words in other pairs with the analogy task a is to b as c is to d, e.g. *man is to woman as king is to ?*

In [None]:
# Check whether our embeddings are good at the analogy task
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

[('queen', 0.7118193507194519)]


👋 ⚒ How can we get the first most similiar word of good from the list of the top 5 most similar words?

In [None]:
def analogy(a, b, c):
  #Your code goes here


print(analogy("France", "Paris", "Austria"))
print(analogy("good", "best", "bad"))

We can use matplotlib to visualize the proximity of words in vector space.

In [None]:
def display_pca_scatterplot(model, words):

    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]

    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(model,
                        ['coffee', 'tea', 'beer', 'wine', 'water',
                         'hamburger', 'pizza',  'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'lizard',
                         'France', 'Germany', 'Hungary',
                         'school', 'college', 'university', 'institute'])

👋 ⚒ Write a function that runs through the text file analogy.txt and test your analogy function on each line apart from the header.

In [None]:
analogy = open("Tutorial3.txt", "r")

# Your code here