This notebook explores WordNet synsets, presenting a simple method for finding in a text all mentions of all hyponyms of a given node in the WordNet hierarchy (e.g., finding all *buildings* in a text).

In [4]:
import nltk, re, spacy
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/danielfurman/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [5]:
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

Get the synsets for a given word.  The synsets here are roughly ordered by frequency of use (in a small tagged dataset), so that more frequent senses occur first.

In [6]:
synsets=wn.synsets('blue')
for synset in synsets:
    print (synset, synset.definition())

Synset('blue.n.01') blue color or pigment; resembling the color of the clear sky in the daytime
Synset('blue.n.02') blue clothing
Synset('blue.n.03') any organization or party whose uniforms or badges are blue
Synset('blue_sky.n.01') the sky as viewed during daylight
Synset('bluing.n.01') used to whiten laundry or hair or give it a bluish tinge
Synset('amobarbital_sodium.n.01') the sodium salt of amobarbital that is used as a barbiturate; used as a sedative and a hypnotic
Synset('blue.n.07') any of numerous small butterflies of the family Lycaenidae
Synset('blue.v.01') turn blue
Synset('blue.s.01') of the color intermediate between green and violet; having a color similar to that of a clear unclouded sky; - Helen Hunt Jackson
Synset('blue.s.02') used to signify the Union forces in the American Civil War (who wore blue uniforms)
Synset('gloomy.s.02') filled with melancholy and despondency
Synset('blasphemous.s.02') characterized by profanity or cursing
Synset('blue.s.05') suggestive of 

Get the words/phrases in that synset.

In [7]:
for lemma in wn.synset("blue.n.01").lemmas():
    print (lemma.name())

blue
blueness


In [8]:
# Functions from http://www.nltk.org/howto/wordnet.html to get *all* of a synset's hyponym/hypernyms
hypo = lambda s: s.hyponyms()
hyper = lambda s: s.hypernyms()

Find all of the synsets that are hyponyms of the target synset (*descendents* in the WordNet hierarchy)

In [9]:
list(wn.synset("blue.n.01").closure(hypo))

[Synset('azure.n.01'),
 Synset('dark_blue.n.01'),
 Synset('greenish_blue.n.01'),
 Synset('powder_blue.n.01'),
 Synset('prussian_blue.n.02'),
 Synset('purplish_blue.n.01'),
 Synset('steel_blue.n.01'),
 Synset('ultramarine.n.02')]

Find all of the synsets that are hyperyms (*ancestors* up the tree) of the target synset

In [10]:
list(wn.synset("blue.n.01").closure(hyper))

[Synset('chromatic_color.n.01'),
 Synset('color.n.01'),
 Synset('visual_property.n.01'),
 Synset('property.n.02'),
 Synset('attribute.n.02'),
 Synset('abstraction.n.06'),
 Synset('entity.n.01')]

In [11]:
def get_words_in_hypo(synset):
    """ Returns a list of words/phrases that comprise the hyponyms of a synset. 
    """
    words=set()
    hyponym_synsets=list(synset.closure(hypo))
    hyponym_synsets.append(synset)
    for synset in hyponym_synsets:
        for l in synset.lemmas():
            word=l.name()
            word=re.sub("_", " ", word)
            words.add(word)
    
    return words

In [12]:
get_words_in_hypo(wn.synset("color.n.01"))

{"Davy's gray",
 "Davy's grey",
 'Indian red',
 'Paris green',
 'Prussian blue',
 'Turkey red',
 'Tyrian purple',
 'Vandyke brown',
 'Venetian red',
 'achromasia',
 'achromatic color',
 'achromatic colour',
 'alabaster',
 'alizarine red',
 'amber',
 'apatetic coloration',
 'aposematic coloration',
 'apricot',
 'aqua',
 'aquamarine',
 'ash gray',
 'ash grey',
 'azure',
 'beige',
 'black',
 'blackness',
 'bleach',
 'blond',
 'blonde',
 'blondness',
 'blue',
 'blue green',
 'blueness',
 'bluish green',
 'bone',
 'bottle green',
 'brick red',
 'brown',
 'brownish yellow',
 'brownness',
 'buff',
 'burgundy',
 'burnt sienna',
 'burnt umber',
 'canary',
 'canary yellow',
 'caramel',
 'caramel brown',
 'cardinal',
 'carmine',
 'carnation',
 'cerise',
 'cerulean',
 'chalk',
 'charcoal',
 'charcoal gray',
 'charcoal grey',
 'chartreuse',
 'cherry',
 'cherry red',
 'chestnut',
 'chocolate',
 'chromatic color',
 'chromatic colour',
 'chromatism',
 'chrome green',
 'chrome red',
 'claret',
 'coal b

In [13]:
def find_all_words_in_text(words, spacy_tokens):
    """ For a given set of words, find each instance among a list of tokens already
    processed by spacy.  Returns a list of token indexes that match.  (Note this only
    identifies single words, not multi-word phrases.)
    """
    all_matches=[]
    for idx, token in enumerate(spacy_tokens):
        if token.lemma_ in words:
            all_matches.append(idx)
    return all_matches

In [14]:
def print_concordance(matches, spacy_tokens, window=3):
    """ For a given set of token indexes, prints out a window of words around each match,
    in the style of a concordance.
    """
    
    RED="\x1b[31m"
    BLACK="\x1b[0m"
    
    spacing=window*10
    for match in matches:
        start=match-window
        end=match+window+1
        if start < 0:
            start=0
        if end > len(spacy_tokens):
            end=len(spacy_tokens)
        pre=' '.join([token.text for token in spacy_tokens[start:match]])
        post=' '.join([token.text for token in spacy_tokens[match+1:end]])
        print("%s %s%s%s %s" % (pre.rjust(spacing), RED, spacy_tokens[match].text, BLACK, post))

In [15]:
def read_text(filename):
    """ Read a text, replacing all whitespace sequences with a single space.
    """
    with open(filename, encoding="utf-8") as file:
        return re.sub("\s+", " ", file.read())

In [16]:
book=read_text("../data/pride_and_prejudice.txt")

In [17]:
spacy_tokens=nlp(book)

In [18]:
def wordnet_search(synset, spacy_tokens):
    """ This functions searchs through all of the tokens in the spacy_tokens argument to find
    any mention of words in the synset or any of its hyponyms.
    """
    targets=get_words_in_hypo(synset)
    matches=find_all_words_in_text(targets, spacy_tokens)
    print_concordance(matches, spacy_tokens)

Q1. Let's do a very coarse tagging of a document to find all of the mentions of a specific WordNet synset and all of its hyponyms. Using the functions above, find all of the color terms in *Pride and Prejudice*.

In [19]:
wordnet_search(wn.synset("color.n.01"), spacy_tokens)

                     he wore a [31mblue[0m coat , and
                    and rode a [31mblack[0m horse . An
                   a bottle of [31mwine[0m a day .
                     I liked a [31mred[0m coat myself very
                  given to her [31mcomplexion[0m , and doubt
              till summoned to [31mcoffee[0m . She was
                 walking , the [31mtone[0m of her voice
                   with a fine [31mcomplexion[0m and good -
                   , but their [31mcolour[0m and shape ,
             Nicholls has made [31mwhite[0m soup enough ,
                        is _ a [31mshade[0m in a character
            reject the offered [31molive[0m - branch .
                   idea of the [31molive[0m - branch perhaps
                     come in a [31mscarlet[0m coat , and
                  in any other [31mcolour[0m . As for
                 In a softened [31mtone[0m she declared herself
                . Both changed [31mcolour[0m , one

Q2. Find all of the vehicles mentioned in *Pride and Prejudice*.

In [20]:
synsets=wn.synsets('vehicle')
for synset in synsets:
    print (synset, synset.definition())

Synset('vehicle.n.01') a conveyance that transports people or objects
Synset('vehicle.n.02') a medium for the expression or achievement of something
Synset('vehicle.n.03') any substance that facilitates the use of a drug or pigment or other material that is mixed with it
Synset('fomite.n.01') any inanimate object (as a towel or money or clothing or dishes or books or toys etc.) that can transmit infectious agents from one person to another


In [21]:
wordnet_search(wn.synset("vehicle.n.01"), spacy_tokens)

                   Monday in a [31mchaise[0m and four to
                    not keep a [31mcarriage[0m , and had
                     ball in a [31mhack[0m chaise . ”
                     in a hack [31mchaise[0m . ” “
                    I have the [31mcarriage[0m ? ” said
                Mr. Bingley 's [31mchaise[0m to go to
                     go in the [31mcoach[0m . ” “
                could have the [31mcarriage[0m . ” Elizabeth
                  , though the [31mcarriage[0m was not to
               offered her the [31mcarriage[0m , and she
                  offer of the [31mchaise[0m to an invitation
        afterwards ordered her [31mcarriage[0m . Upon this
                     to give a [31mflat[0m denial , and
                in general and [31mordinary[0m cases between friend
                  beg that the [31mcarriage[0m might be sent
             possibly have the [31mcarriage[0m before Tuesday ;
                Mr. Bingley 's [31mcarriag

  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):
  for synset in acyclic_breadth_first(self, rel, depth):


Q3. Find all of the verbs of speaking in *Pride and Prejudice*.

In [22]:
synsets=wn.synsets('speak')
for synset in synsets:
    print (synset, synset.definition())

Synset('talk.v.02') express in speech
Synset('talk.v.01') exchange thoughts; talk with
Synset('speak.v.03') use language
Synset('address.v.02') give a speech to
Synset('speak.v.05') make a characteristic or natural sound


In [23]:
wordnet_search(wn.synset("speak.v.03"), spacy_tokens)

                   how can you [31mtalk[0m so ! But
                       , as he [31mspoke[0m , he left
                   early , and [31mtalked[0m of giving one
        amiable qualities must [31mspeak[0m for themselves .
                    the room , [31mspeaking[0m occasionally to one
               never heard you [31mspeak[0m ill of a
                  but I always [31mspeak[0m what I think
                 in which they [31mspoke[0m of the Meryton
                should meet to [31mtalk[0m over a ball
                 saw Mr. Darcy [31mspeaking[0m to her .
                angry at being [31mspoke[0m to . ”
                 that he never [31mspeaks[0m much , unless
                 he would have [31mtalked[0m to Mrs. Long
                  mind his not [31mtalking[0m to Mrs. Long
             sisters not worth [31mspeaking[0m to , a
              any intention of [31mspeaking[0m , Miss Lucas
                  . They could [31mtalk[0m of nothin

Q4. Find all of the people in *Pride and Prejudice*.

In [27]:
synsets=wn.synsets('name')
for synset in synsets:
    print (synset, synset.definition())

Synset('name.n.01') a language unit by which a person or thing is known
Synset('name.n.02') a person's reputation
Synset('name.n.03') family based on male descent
Synset('name.n.04') a well-known or notable person
Synset('name.n.05') by the sanction or authority of
Synset('name.n.06') a defamatory or abusive word or phrase
Synset('name.v.01') assign a specified (usually proper) proper name to
Synset('name.v.02') give the name or identifying characteristics of; refer to by name or some other identifying characteristic property
Synset('name.v.03') charge with a function; charge to be
Synset('appoint.v.01') create and charge with a task or function
Synset('name.v.05') mention and identify by name
Synset('mention.v.01') make reference to
Synset('identify.v.05') identify as in botany or biology, for example
Synset('list.v.01') give or make a list of; name individually; give the names of
Synset('diagnose.v.01') determine or distinguish the nature of a problem or an illness through a diagnost

In [28]:
wordnet_search(wn.synset("name.n.01"), spacy_tokens)

   online at www.gutenberg.org [31mTitle[0m : Pride and
                     “ My dear [31mMr.[0m Bennet , ”
                      last ? ” [31mMr.[0m Bennet replied that
                       ; “ for [31mMrs.[0m Long has just
                        it . ” [31mMr.[0m Bennet made no
                   must know , [31mMrs.[0m Long says that
                he agreed with [31mMr.[0m Morris immediately ;
                   What is his [31mname[0m ? ” “
                     “ My dear [31mMr.[0m Bennet , ”
                     of them , [31mMr.[0m Bingley may like
                    go and see [31mMr.[0m Bingley when he
                    I dare say [31mMr.[0m Bingley will be
                         . ” “ [31mMr.[0m Bennet , how
                       all . ” [31mMr.[0m Bennet was so
                   . Chapter 2 [31mMr.[0m Bennet was among
                 who waited on [31mMr.[0m Bingley . He
                      “ I hope [31mMr.[0m Bingley will lik

Q5. The methods above all identify *any* mentions of a WordNet synset in a text  -- e.g., every instance of *bank* would be identified as a hit for query bank.n.01 ("sloping land ..."), even if its specific word sense in context was the financial institution (or even a verb).  How might we improve on this method?

* We would improve this method by taking only synonyms of the word we are interested in. Synonyms indicate that the words could be replaced without altering the truth value of the statement. Therefore, if we only took synonyms, we would only get the instance we are interested in. We could potentially accomplish this by finding the Cosine similarity betweeen BERT embeddings, and by setting some threshold similarity to determine if the two words are synonyms or not. 