*ANLP 2020/21; Uni Potsdam; D. Schlangen, B. Aktas*

# Work Sheet for Week 04: Word Vectors

## Notes

Please remember that these exercises are only meant to guide your active learning, and handing them in is only for us to see where additional feedback might be needed.

So, a) make use of your peers as much as possible, for example in learning groups, & by asking questions on piazza; b) use these exercises to have something to do during the practical on Thursdays; c) prioritise the assignments over the worksheets, and only work on the worksheets (outside of the practical) once you are sufficiently happy with your progress on the assignment.

Also, it is highly unlikely that you will have enough time to work on all of these! (Unless you are already somewhat familiar with the material.) This is just to give you a selection of exercises to work on. Or maybe you will want to come back to some of them in the teaching break.


## Exercises

### [E1] From Frequency to Meaning

Consider the following word-context matrix with the three words “orange”, “banana” and “car” and the three context words “juice”, “the” and “drive”.


|   |         |            |   |
|:--| ------------- |:-------------:| -----:|
|   |       juice  | the           | drive  |
|orange| 10 | 20 | 0 |
|banana| 8| 20 | 0 |
|car| 1 | 20 | 10|


1. Compute the cosine similarity values for “orange” and “banana” and for “orange” and “car”. How you interpret the results of this computation?
2. Compute the MLEs using frequencies for the probabilities P(w), P(c) and P(w, c) for each word w and each context word c.
3. Based on these, compute the PPMI values for the cells in the matrix.
4. Now compute the cosine similarity values of the PPMI vectors for “orange” and “banana” and for “orange” and “car”. Are there any differences with the frequency based vectors computed in item (1)?

---

### [E2] Word Meaning Representations

Can you come up with a manually constructed vector representation for concepts / word meanings? For example, let's assume that the concept "woman" is represented by `human: yes, gender: female`, how could that be understood as a vector? How would you represent "man" in this system? How would you add more concepts (such as "fish", or "refridgerator", or "linguistics") to this system?
Is it the case that in your representation system, cosine similarity tracks semantic closeness? What would the advantage of such a system be compared to learned word vectors? What are the disadvantages?

### [E3] Word Embeddings

Please load the word embeddings trained on the Google News corpora using the [gensim](https://radimrehurek.com/gensim/index.html) package. First, make sure that you have the gensim package installed on your computer. You can find the model [here](https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM). The model has a size of 1.5GB, so make sure you have some free memory before you load it. It is also possible to use one of the pre-trained glove models [here](https://nlp.stanford.edu/projects/glove/).

---

### [E4] Word Embeddings

This exercise aims to let you evaluate word similarities using word embeddings. Compute the similarity between some of the words existing in **wordsim_similarity_goldstandard.txt** dataset which includes human judgements for the word similarities and check whether the human judgements correlate with the distances from the word vectors you computed.

You can use the word embeddings trained on the Google News corpora using the [gensim](https://radimrehurek.com/gensim/index.html) package. First, make sure that you have the gensim package installed on your computer. You can find the model [here](https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM). The model has a size of 1.5GB, so make sure you have some free memory before you load it. It is also possible to use one of the pre-trained glove models [here](https://nlp.stanford.edu/projects/glove/).

If you don't have enough resources on your computer, you can also use the online interface [here](http://bionlp-www.utu.fi/wv_demo/).

```python
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
```


The gensim package has many easy to use [functions](https://radimrehurek.com/gensim/models/keyedvectors.html). Experiment with those functions to find interesting relations in the word embeddings (`most_similar`, `doesnt_match`, `similarity`). 

You can make use of the data set wordsim_similarity_goldstandard.txt which includes human judgments about similarities of some words. Compute the similarity between the words existing in this dataset and check whether the human judgments correlate with the distances from the word vectors you computed.

---

### [E4] Paraphraser

The gensim package has many easy to use [functions](https://radimrehurek.com/gensim/models/keyedvectors.html). You can experiment with those functions to find interesting relations in the word embeddings (`most_similar`, `doesnt_match`, `similarity`) and to further explore the package.

---

### [E5] Paraphraser

Sometimes it might be useful to paraphrase a sentence. With similar words, it is possible to use them for a simple paraphraser system. How can you use word embeddings to implement such a system?

ANSWER: one could replace some fraction of the words in the sentence with their closest neigbours. This would only work for replacing the "meaningful" words, the verbs, the nouns, the adjectives, but not function words such as articles or prepostions. Replacing the numerals also would clearly be a bad idea. However, even replacing the nouns with their closest counterpart of the same part of speech would still result in weird kind of paraphrasing, as the fixed expressions and normal collocations would not be preserved. It should also be noted, hthat real paraphrasing relies on grammar more than on simply replacing the words in a sentence with synonyms, and large cosine similarity does not even guarantee that the words are true synonyms.

If you have time and resources, implement such a paraphraser (hint: you can use the `most_similar()` function in gensim.models.Word2Vec package). Does the paraphraser perform well? Are there cases where it maybe does not perform as well as expected?


### [E6] Evaluate Word Embedding
With the model, Google also released syntactic and semantic test examples, following the \"A is to B as C is to D\" style. The evaluation file can be found [here](https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt). Please use these examples to evaluate the model. What is the accuracy of the model on these evaluation examples?

Try to find some interesting examples of relations the model did not predict correctly. Do you think this evaluation is helpful? Please comment briefly!

ANSWER: I think this mode of evaluation is only one of the many possible ways of almost-intrinsic embedding model evaluation, and possibly not the best one. By almost-intrinsic I mean that it is not related to any NLP task in particular, but only to the semantic relations between words that we might like to preserve in the representation. However, it is very questionable if this kind of grammatical (dog:dogs = cat:cats) and semantic (Russia:ruble = Mexico:peso) representation is indeed reflective of the kind of relations that we would want to preserve. And many of the words in this test set might be absent from the training corpus if it is not very large (the capital city of Alabama?..)

---


### [E7] Limitations

Using word embeddings, words or phrases can be mapped to a continuous vector space (e.g. Word2Vec, GloVe). In your opinion, is there a limitation to this approach? Please make a brief comment.

ANSWER: This type of representation treats words as separate entities, as if looking up their meaning in a continuous vector space dictionary. That implies "bag of wordds" approach for sentences, too. This issue is partuially solved by the introduction of context-dependent embeddings, but here the questions of the corpus size and representability come into play even more. If i need to apply word embeddings to spoken language, is it valid to assume that the meaning of the words would be well-represented with a model trained on a corpus of written language? Is it reasonable to assume that this model would ge

---

### [E8] Lots of Banks

Suppose you are given a sentence: 'The man was accused of robbing a bank.' and 'The man went fishing by the bank of the river.' Do you think the word embedding for the word 'bank' will be similar? Does this pose a problem?

ANSWER: Yes, the context-independent word embeddingd cannot distinguish between the meanings of polysemic words and homonyms and represent them with one vector. This vector is not even an average between the senses, as the frequency of the senses influences it as well. Thus, this resulting vector might be very far from any of the true senses of the word. Context-dependent embeddings, such as ELMo or BERT try to account for the parts of meaning that can be derived from the sentence, surrounding the word, and thus are assumed to be better at handling such cases.

---