# 11. Semantics 1: words - Lab excercises

### 11.E1 [Accessing WordNet using NLTK](#11.E1)

### 11.E2 [Using word embeddings](#11.E2)

### 11.E3 [Comparing WordNet and word embeddings](#11.E3)

## 11.E1 Accessing WordNet using NLTK
<a id='11.E1'></a>

NLTK (_Natural Language Toolkit_) is a python library for accessing many NLP tools and resources. The NLTK WordNet interface is described here: http://www.nltk.org/howto/wordnet.html

The NLTK python package can be installed using pip:

In [None]:
!pip install nltk

Import nltk and use its internal download tool to get WordNet:

In [None]:
import nltk
nltk.download('wordnet')

Import the wordnet module:

In [None]:
from nltk.corpus import wordnet as wn

Access synsets of a word using the _synsets_ function:

In [None]:
club_synsets = wn.synsets('club')
print(club_synsets)

Each synset has a _definition_ function:

In [None]:
for synset in club_synsets:
    print("{0}\t{1}".format(synset.name(), synset.definition()))

In [None]:
dog = wn.synsets('dog')[0]
dog.definition()

List lemmas of a synset:

In [None]:
dog.lemmas()

List hypernyms and hyponyms of a synset

In [None]:
dog.hypernyms()

In [None]:
dog.hyponyms()

The _closure_ method of synsets allows us to retrieve the transitive closure of the hypernym, hyponym, etc. relations:

In [None]:
list(dog.closure(lambda s: s.hypernyms()))

common_hypernyms and lowest_common_hypernyms work in relation to another synset:

In [None]:
cat = wn.synsets('cat')[0]
dog.lowest_common_hypernyms(cat)

In [None]:
dog.common_hypernyms(cat)

In [None]:
dog.path_similarity(cat)

To iterate through all synsets, possibly by POS-tag, use all_synsets, which returns a generator:

In [None]:
wn.all_synsets(pos='n')

In [None]:
for c, noun in enumerate(wn.all_synsets(pos='n')):
    if c > 5:
        break
    print(noun.name())

__Excercise (optional)__: use WordNet to implement the "Guess the category" game: the program lists lemmas that all share a hypernym, which the user has to guess.

## 11.E2 Using word embeddings
<a id='11.E2'></a>

- Download and extract the word embedding [glove.6B](http://nlp.stanford.edu/data/glove.6B.zip), which was trained on 6 billion words of English text using the [GloVe](https://nlp.stanford.edu/projects/glove/) algorithm.

In [None]:
!wget http://sandbox.hlt.bme.hu/~recski/stuff/glove.6B.50d.txt.gz
!gunzip -f glove.6B.50d.txt.gz

- Read the embedding into a 2D numpy array. Word forms should be stored in a separate 1D array. Also create a word index, a dictionary that returns the index of each word in the embedding. Vectors should be normalized to a length of 1

In [None]:
import numpy as np

In [None]:
# words, word_index, emb = read_embedding('glove.6B.50d.txt')
# emb = normalize_embedding(emb)

- write a function that takes two words and the embedding as input and returns their cosine similarity

In [None]:
# vec_sim('cat', 'dog', word_index, emb)

- Implement a function that takes a word as a parameter and returns the 5 words that are closest to it in the embedding space

In [None]:
# print(nearest_n('dog', words, word_index, emb))
# print(nearest_n('king', words, word_index, emb))

## 11.E3 Vector similarity in WordNet
<a id='11.E3'></a>

Use the code written in __11.E2__ to analyze word groups in WordNet:

- Create an embedding of WordNet synsets by mapping each of them to the mean of their lemmas' vectors.

In [None]:
# synset_emb = embed_synsets(words, word_index, emb)

- write a function that measures the similarity of two synsets based on the cosine similarity of their vectors

In [None]:
# synset_sim(dog, cat, synset_emb)

- Write a function that takes a synset as input and retrieves the n most similar synsets, using the above embedding

In [None]:
# nearest_n_synsets(wn.synsets('penguin')[0], synset_emb, 10)

- Build the list of all words that are both in wordnet and the GloVe embedding. On a sample of 100 such words, measure Spearman correlation of synset similarity and vector similarity (use scipy.stats.spearmanr)

In [None]:
# compare_sims(sample, synset_emb, word_index, emb)