# 11. Semantics 1: words - Lab excercises

### 11.E1 [Accessing WordNet using NLTK](#11.E1)

### 11.E2 [Using word embeddings](#11.E2)

### 11.E3 [Comparing WordNet and word embeddings](#11.E3)

## 11.E1 Accessing WordNet using NLTK
<a id='11.E1'></a>

NLTK (_Natural Language Toolkit_) is a python library for accessing many NLP tools and resources. The NLTK WordNet interface is described here: http://www.nltk.org/howto/wordnet.html

The NLTK python package can be installed using pip:

In [1]:
!pip install nltk



Import nltk and use its internal download tool to get WordNet:

In [2]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/recski/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Import the wordnet module:

In [3]:
from nltk.corpus import wordnet as wn

Access synsets of a word using the _synsets_ function:

In [4]:
club_synsets = wn.synsets('club')
print(club_synsets)

[Synset('baseball_club.n.01'), Synset('club.n.02'), Synset('club.n.03'), Synset('clubhouse.n.01'), Synset('golf_club.n.02'), Synset('club.n.06'), Synset('cabaret.n.01'), Synset('club.v.01'), Synset('club.v.02'), Synset('club.v.03'), Synset('club.v.04')]


Each synset has a _definition_ function:

In [5]:
for synset in club_synsets:
    print("{0}\t{1}".format(synset.name(), synset.definition()))

baseball_club.n.01	a team of professional baseball players who play and travel together
club.n.02	a formal association of people with similar interests
club.n.03	stout stick that is larger at one end
clubhouse.n.01	a building that is occupied by a social club
golf_club.n.02	golf equipment used by a golfer to hit a golf ball
club.n.06	a playing card in the minor suit that has one or more black trefoils on it
cabaret.n.01	a spot that is open late at night and that provides entertainment (as singers or dancers) as well as dancing and food and drink
club.v.01	unite with a common purpose
club.v.02	gather and spend time together
club.v.03	strike with a club or a bludgeon
club.v.04	gather into a club-like mass


In [6]:
dog = wn.synsets('dog')[0]
dog.definition()

'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

List lemmas of a synset:

In [7]:
dog.lemmas()

[Lemma('dog.n.01.dog'),
 Lemma('dog.n.01.domestic_dog'),
 Lemma('dog.n.01.Canis_familiaris')]

List hypernyms and hyponyms of a synset

In [8]:
dog.hypernyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

In [9]:
dog.hyponyms()

[Synset('basenji.n.01'),
 Synset('corgi.n.01'),
 Synset('cur.n.01'),
 Synset('dalmatian.n.02'),
 Synset('great_pyrenees.n.01'),
 Synset('griffon.n.02'),
 Synset('hunting_dog.n.01'),
 Synset('lapdog.n.01'),
 Synset('leonberg.n.01'),
 Synset('mexican_hairless.n.01'),
 Synset('newfoundland.n.01'),
 Synset('pooch.n.01'),
 Synset('poodle.n.01'),
 Synset('pug.n.01'),
 Synset('puppy.n.01'),
 Synset('spitz.n.01'),
 Synset('toy_dog.n.01'),
 Synset('working_dog.n.01')]

The _closure_ method of synsets allows us to retrieve the transitive closure of the hypernym, hyponym, etc. relations:

In [10]:
list(dog.closure(lambda s: s.hypernyms()))

[Synset('canine.n.02'),
 Synset('domestic_animal.n.01'),
 Synset('carnivore.n.01'),
 Synset('animal.n.01'),
 Synset('placental.n.01'),
 Synset('organism.n.01'),
 Synset('mammal.n.01'),
 Synset('living_thing.n.01'),
 Synset('vertebrate.n.01'),
 Synset('whole.n.02'),
 Synset('chordate.n.01'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('entity.n.01')]

common_hypernyms and lowest_common_hypernyms work in relation to another synset:

In [11]:
cat = wn.synsets('cat')[0]
dog.lowest_common_hypernyms(cat)

[Synset('carnivore.n.01')]

In [12]:
dog.common_hypernyms(cat)

[Synset('placental.n.01'),
 Synset('object.n.01'),
 Synset('vertebrate.n.01'),
 Synset('living_thing.n.01'),
 Synset('carnivore.n.01'),
 Synset('animal.n.01'),
 Synset('mammal.n.01'),
 Synset('chordate.n.01'),
 Synset('whole.n.02'),
 Synset('entity.n.01'),
 Synset('organism.n.01'),
 Synset('physical_entity.n.01')]

In [13]:
dog.path_similarity(cat)

0.2

To iterate through all synsets, possibly by POS-tag, use all_synsets, which returns a generator:

In [14]:
wn.all_synsets(pos='n')

<generator object WordNetCorpusReader.all_synsets at 0x7f7060ef3ca8>

In [15]:
for c, noun in enumerate(wn.all_synsets(pos='n')):
    if c > 5:
        break
    print(noun.name())

entity.n.01
physical_entity.n.01
abstraction.n.06
thing.n.12
object.n.01
whole.n.02


__Excercise (optional)__: use WordNet to implement the "Guess the category" game: the program lists lemmas that all share a hypernym, which the user has to guess.

## 11.E2 Using word embeddings
<a id='11.E2'></a>

- Download and extract the word embedding [glove.6B](http://nlp.stanford.edu/data/glove.6B.zip), which was trained on 6 billion words of English text using the [GloVe](https://nlp.stanford.edu/projects/glove/) algorithm.

In [16]:
!wget http://sandbox.hlt.bme.hu/~recski/stuff/glove.6B.50d.txt.gz
!gunzip -f glove.6B.50d.txt.gz

--2017-11-30 15:11:51--  http://sandbox.hlt.bme.hu/~recski/stuff/glove.6B.50d.txt.gz
Resolving sandbox.hlt.bme.hu (sandbox.hlt.bme.hu)... 152.66.88.21
Connecting to sandbox.hlt.bme.hu (sandbox.hlt.bme.hu)|152.66.88.21|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69182520 (66M) [application/x-gzip]
Saving to: ‘glove.6B.50d.txt.gz’


2017-11-30 15:11:57 (11,1 MB/s) - ‘glove.6B.50d.txt.gz’ saved [69182520/69182520]



- Read the embedding into a 2D numpy array. Word forms should be stored in a separate 1D array. Also create a word index, a dictionary that returns the index of each word in the embedding. Vectors should be normalized to a length of 1

In [17]:
import numpy as np

In [18]:
def read_embedding(fn):
    words = []
    emb = []
    word_index = {}
    c = 0
    with open(fn, encoding='utf-8') as f:
        for line in f:
            fields = line.strip().split()
            emb.append(np.array([float(i) for i in fields[1:]], dtype='float32'))
            words.append(fields[0])
            word_index[fields[0]] = c
            c += 1

    print("read {0} lines".format(c))
    return np.array(words), word_index, np.array(emb)    

In [19]:
def normalize_embedding(emb):
    return emb / np.linalg.norm(emb, axis=1)[:,None]

In [20]:
words, word_index, emb = read_embedding('glove.6B.50d.txt')
emb = normalize_embedding(emb)

read 400000 lines


- write a function that takes two words and the embedding as input and returns their cosine similarity

In [21]:
def vec_sim(w1, w2, word_index, emb):
    if w1 not in word_index or w2 not in word_index:
        return None
    return np.dot(emb[word_index[w1]], emb[word_index[w2]])

In [22]:
vec_sim('cat', 'dog', word_index, emb)

0.92180049

- Implement a function that takes a word as a parameter and returns the 5 words that are closest to it in the embedding space

In [23]:
def nearest_n(word, words, word_index, emb, n=5):
    try:
        w_index = word_index[word]
    except KeyError:
        return None
    w_vec = emb[w_index]

    distances = np.dot(emb, w_vec)
    indices = np.argsort(distances)[-n:][::-1]
    return [words[i] for i in indices]          

In [24]:
print(nearest_n('dog', words, word_index, emb))
print(nearest_n('king', words, word_index, emb))

['dog', 'cat', 'dogs', 'horse', 'puppy']
['king', 'prince', 'queen', 'ii', 'emperor']


## 11.E3 Vector similarity in WordNet
<a id='11.E3'></a>

Use the code written in __11.E2__ to analyze word groups in WordNet:

- Create an embedding of WordNet synsets by mapping each of them to the mean of their lemmas' vectors.

In [25]:
def embed_synset(synset, words, word_index, emb):
    word_set = [lemma.name() for lemma in synset.lemmas()]
    indices = filter(None, map(word_index.get, word_set))
    vecs = np.array([emb[i] for i in indices])
    if len(vecs) == 0:
        return None
    return np.mean(vecs, axis=0)

def embed_synsets(words, word_index, emb):
    return {synset: embed_synset(synset, words, word_index, emb) for synset in wn.all_synsets()}

In [26]:
synset_emb = embed_synsets(words, word_index, emb)

- write a function that measures the similarity of two synsets based on the cosine similarity of their vectors

In [27]:
def synset_sim(ss1, ss2, synset_emb):
    vec1 = synset_emb[ss1]
    vec2 = synset_emb[ss2]
    if vec1 is None or vec2 is None:
        return None
    return np.dot(vec1, vec2)

In [28]:
synset_sim(dog, cat, synset_emb)

0.92180049

- Write a function that takes a synset as input and retrieves the n most similar synsets, using the above embedding

In [29]:
def nearest_n_synsets(synset, synset_emb, n=5):
    distances = [(synset_sim(synset, other, synset_emb), other) for other in wn.all_synsets() if synset != other]
    distances = [(sim, synset) for sim, synset in distances if not sim is None]
    return sorted(distances, reverse=True)[:n]

In [30]:
%%time
nearest_n_synsets(wn.synsets('penguin')[0], synset_emb, 10)

CPU times: user 3.43 s, sys: 76 ms, total: 3.51 s
Wall time: 3.51 s


[(0.67804796, Synset('puffin.n.01')),
 (0.63854223, Synset('bantam.n.01')),
 (0.63685226, Synset('paperback_book.n.01')),
 (0.63685226, Synset('paperback.s.01')),
 (0.62880427, Synset('imprint.n.05')),
 (0.62880427, Synset('imprint.n.04')),
 (0.62880427, Synset('imprint.n.03')),
 (0.62880427, Synset('imprint.n.01')),
 (0.59954, Synset('signet.n.01')),
 (0.56959021, Synset('python.n.02'))]

- Build the list of all words that are both in wordnet and the GloVe embedding. On a sample of 100 such words, measure Spearman correlation of synset similarity and vector similarity (use scipy.stats.spearmanr)

In [31]:
words_in_both = [word for word in wn.all_lemma_names() if word in word_index]

In [32]:
len(words_in_both)

55666

In [33]:
import random
sample = random.sample(words_in_both, 100)

In [34]:
from scipy.stats import spearmanr

def compare_sims(sample, synset_emb, word_index, emb):
    vec_sims, ss_sims = [], []
    for w1 in sample:
        for w2 in sample:
            ss_sim = synset_sim(wn.synsets(w1)[0], wn.synsets(w2)[0], synset_emb)
            if ss_sim is None:
                continue
            v_sim = vec_sim(w1, w2, word_index, emb)
            vec_sims.append(v_sim)
            ss_sims.append(ss_sim)
    return spearmanr(vec_sims, ss_sims)

In [35]:
compare_sims(sample, synset_emb, word_index, emb)

SpearmanrResult(correlation=0.85134837070058833, pvalue=0.0)