Dictionary Tools - WordNet
===

<div class="alert alert-info">If you are a user jump to section <b>Part II: Script</b> (where you simply have to add seed word to the function and obtain the bag of words</div>

In [15]:
import jupyter_black

jupyter_black.load()

## Part I: Tutorial

In [16]:
import nltk

# nltk.download('all')
from nltk.corpus import wordnet as wn

### Synsets
Synsets are the set possible **meanings** of the word, this is a function from WordNet (wn). First we take meanings which can be Part of Speech, Nouns, Verbs, or Adjectives

#### List of available languages

In [17]:
wn.langs()

dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'dan', 'ell', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])

In [18]:
word = "anger"
language = "eng"


meanings = wn.synsets(word, pos=wn.NOUN + wn.VERB + wn.ADJ, lang=language)

for meaning in meanings:
    print(word, meaning, meaning.definition())

anger Synset('anger.n.01') a strong emotion; a feeling that is oriented toward some real or supposed grievance
anger Synset('anger.n.02') the state of being angry
anger Synset('wrath.n.02') belligerence aroused by a real or supposed wrong (personified as one of the deadly sins)
anger Synset('anger.v.01') make angry
anger Synset('anger.v.02') become angry


### Synonyms obtained with lemmas()

In [19]:
## here we pick up the first synset of the list.
print(meanings[0].name())

synonyms = wn.synset(meanings[0].name()).lemmas(language)
synonyms

anger.n.01


[Lemma('anger.n.01.anger'),
 Lemma('anger.n.01.choler'),
 Lemma('anger.n.01.ire')]

In [20]:
## here we pick up the third lemma of the list
synonyms[2].name()

'ire'

In [21]:
# or like this
wn.lemma("anger.n.01.ire").name()

'ire'

### Hyponyms
Obtaining hyponyms (words that are semantically lower in the hierarchy) for example dogs and cats are hyponyms of animals. Below are specific kinds of anger

In [22]:
wn.synset("anger.n.01").hyponyms()

[Synset('annoyance.n.02'),
 Synset('bad_temper.n.01'),
 Synset('dander.n.02'),
 Synset('fury.n.01'),
 Synset('huffiness.n.01'),
 Synset('indignation.n.01'),
 Synset('infuriation.n.01'),
 Synset('umbrage.n.01')]

#### Looping down the hierarchy with `.closure()`

Like this we can navigate the hierarchy all the way to the bottom

In [23]:
meaning = wn.synset("anger.n.01")

for item in meaning.closure(lambda s: s.hyponyms()):
    print(item)

Synset('annoyance.n.02')
Synset('bad_temper.n.01')
Synset('dander.n.02')
Synset('fury.n.01')
Synset('huffiness.n.01')
Synset('indignation.n.01')
Synset('infuriation.n.01')
Synset('umbrage.n.01')
Synset('aggravation.n.01')
Synset('displeasure.n.01')
Synset('frustration.n.03')
Synset('harassment.n.01')
Synset('pique.n.02')
Synset('fit.n.01')
Synset('irascibility.n.01')
Synset('lividity.n.01')
Synset('wrath.n.01')
Synset('dudgeon.n.01')


### Hypernyms
Obtaining hypernym (words that are semantically higher in the hierarchy). For example dogs and cats are hyponyms of animals. Below are specific kinds of anger

In [24]:
wn.synset("anger.n.01").hypernyms()

[Synset('emotion.n.01')]

In [25]:
wn.synset("anger.n.01").root_hypernyms()

[Synset('entity.n.01')]

### Generate all synonyms and hyponyms of a word


In [26]:
word = "neglect"
language = "eng"

print("Generating all synonyms for:", word)

meanings = wn.synsets(word, pos=wn.NOUN + wn.VERB + wn.ADJ, lang=language)
list_of_lemmas = []

for i, meaning in enumerate(meanings):
    print(35 * "-", i + 1, 35 * "-")
    print("-", meaning.name())
    print("   - Definition:", meaning.definition())
    print("   - Lemmas:")

    for j, lemma in enumerate(meaning.lemmas(language)):
        print("     -", lemma.name())
        ## add lemma to list
        list_of_lemmas += [lemma.name()]

    print("   - Hyponyms:")

    for hyponym in meaning.closure(lambda s: s.hyponyms()):
        print("     -", hyponym.name())
        print("        - Definition:", hyponym.definition())
        print("        - Lemmas:")
        for lemma in hyponym.lemmas(language):
            print("           -", lemma.name())
            ## add lemma to list
            list_of_lemmas += [lemma.name()]

Generating all synonyms for: neglect
----------------------------------- 1 -----------------------------------
- disregard.n.01
   - Definition: lack of attention and due care
   - Lemmas:
     - disregard
     - neglect
   - Hyponyms:
     - omission.n.04
        - Definition: neglecting to do something; leaving out or passing over something
        - Lemmas:
           - omission
     - exception.n.01
        - Definition: a deliberate act of omission
        - Lemmas:
           - exception
           - exclusion
           - elision
     - oversight.n.01
        - Definition: an unintentional omission resulting from failure to notice something
        - Lemmas:
           - oversight
           - inadvertence
     - pretermission.n.01
        - Definition: letting pass without notice
        - Lemmas:
           - pretermission
----------------------------------- 2 -----------------------------------
- neglect.n.02
   - Definition: the state of something that has been unused and ne

In [27]:
# remove duplicates & sort alphabetically
set_of_lemmas = sorted([*set(list_of_lemmas)])

print("The Lemmas we collected in set_of_lemmas:\n\n", set_of_lemmas)
print("\nLength:", len(set_of_lemmas))

The Lemmas we collected in set_of_lemmas:

 ['bunk_off', 'carelessness', 'choke', 'circumvention', 'comparative_negligence', 'concurrent_negligence', 'contributory_negligence', 'criminal_negligence', 'culpable_negligence', 'cut', 'default', 'default_on', 'delinquency', 'dereliction', 'despite', 'disregard', 'disuse', 'dodging', 'drop', 'elision', 'escape', 'escape_mechanism', 'evasion', 'exception', 'exclusion', 'fail', 'fan', 'forget', 'goldbricking', 'goofing_off', 'ignore', 'inadvertence', 'jump', 'laxity', 'laxness', 'leave_out', 'lose_track', 'malingering', 'miss', 'muff', 'neglect', 'neglect_of_duty', 'neglectfulness', 'negligence', 'nonfeasance', 'nonperformance', 'omission', 'omit', 'overleap', 'overlook', 'oversight', 'pass_over', 'play_hooky', 'pretermission', 'pretermit', 'remissness', 'shirking', 'skip', 'skip_over', 'skulking', 'slack', 'slacking', 'slackness', 'soldiering', 'strike_out', 'whiff', 'willful_neglect']

Length: 67


### Generate all synonyms and hyponyms of a word (Re-write)

Re-write with simpler loops and filters

In [28]:
list_of_lemmas = []

add_to_list = lambda list1, item1: list1.append(item1)
hypos = lambda s: s.hyponyms()

meanings = wn.synsets(word, pos=wn.NOUN + wn.VERB + wn.ADJ, lang=language)

for i, meaning in enumerate(meanings):
    print(35 * "-", i + 1, 35 * "-")
    print("-", meaning.name())
    print("   - Definition:", meaning.definition())
    print("   - Lemmas:")
    ## add lemma to list
    [add_to_list(list_of_lemmas, lemma.name()) for lemma in meaning.lemmas(language)]

    print("   - Hyponyms:")

    for hyponym in meaning.closure(hypos):
        print("     -", hyponym.name())
        print("        - Definition:", hyponym.definition())
        [
            add_to_list(list_of_lemmas, lemma.name())
            for lemma in hyponym.lemmas(language)
        ]

# remove duplicates & sort alphabetically
set_of_lemmas = sorted([*set(list_of_lemmas)])

print("\nset_of_lemmas:\n\n", set_of_lemmas)
print("\nLength:", len(set_of_lemmas))

----------------------------------- 1 -----------------------------------
- disregard.n.01
   - Definition: lack of attention and due care
   - Lemmas:
   - Hyponyms:
     - omission.n.04
        - Definition: neglecting to do something; leaving out or passing over something
     - exception.n.01
        - Definition: a deliberate act of omission
     - oversight.n.01
        - Definition: an unintentional omission resulting from failure to notice something
     - pretermission.n.01
        - Definition: letting pass without notice
----------------------------------- 2 -----------------------------------
- neglect.n.02
   - Definition: the state of something that has been unused and neglected
   - Lemmas:
   - Hyponyms:
     - omission.n.02
        - Definition: something that has been omitted
----------------------------------- 3 -----------------------------------
- disregard.n.02
   - Definition: willful lack of care and attention
   - Lemmas:
   - Hyponyms:
     - despite.n.02
    

<div class="alert alert-info">Main difference to previous approach is that here the lemmas are not printed in-between but at the bottom.</div>

### Exercise

Write the previous cell as a def function, and use the function to obtain a list of words of all terms related to joy

Call should be like `generate_word_list(seed_word, language)`

<div class="alert alert-warning">What exactly is the exercise? The function is already defined below (in 2 variations)</div>

### Create function `generate_word_list`

In [29]:
def generate_word_list(seed_word, language):

    ## we create an empty list to store the final word list
    list_of_lemmas = []

    ## a function to add a word to a list
    add_to_list = lambda list1, item1: list1.append(item1)

    ## a function to return the hyponyms of a synset
    hypos = lambda s: s.hyponyms()

    ## wn.synset obtains the list of synonyms and meanings for that word, in different syntactic categories
    meanings = wn.synsets(seed_word, pos=wn.NOUN + wn.VERB + wn.ADJ)

    ## loop over set of meanings in synset
    for meaning in meanings:

        ## append all synonyms (lemmas()) of that meaning to the list_of_lemmas
        [
            add_to_list(list_of_lemmas, lemma.name())
            for lemma in meaning.lemmas(language)
        ]

        ## loop over the list of all possible hyponyms
        for hyponym in meaning.closure(hypos):

            ## append all synonyms (lemmas()) of that hyponym to the list_of_lemmas
            [
                add_to_list(list_of_lemmas, lemma.name())
                for lemma in hyponym.lemmas(language)
            ]

    ##eliminate list duplations by applying the set transformation
    set_of_lemmas = [*set(list_of_lemmas)]

    ## sort alphabetically
    set_of_lemmas.sort()

    ##length
    length = len(set_of_lemmas)

    return (list_of_lemmas, length)


list_of_lemmas, length = generate_word_list("anger", "eng")

<div class="alert-warning alert">Why is list_of_lemmas (with its potential duplicates) returned and not set_of_lemmas? In the following function I changed it to set_of_lemmas.</div>

### Re-write function `generate_word_list`

in order to create select a list of meanings, we create a the list of synsets.

Difference to previous functions is that this one will additionally return `list_of_meanings`.

In [30]:
def generate_word_list(seed_word, language):

    ## we create an empty list to store the final word list
    list_of_lemmas = []
    list_of_meanings = []

    add_to_list = lambda list1, item1: list1.append(item1)

    ## a function to return the hyponyms of a synset
    hypos = lambda s: s.hyponyms()

    ## wn.synset obtains the list of synonyms and meanings for that word, in different syntactic categories
    meanings = wn.synsets(seed_word, pos=wn.NOUN + wn.VERB + wn.ADJ)

    ## loop over set of meanings in synset
    for meaning in meanings:

        list_of_meanings += [
            [
                meaning,
                meaning.definition(),
                [lemma.name() for lemma in meaning.lemmas(language)],
            ]
        ]

        ## append all synonyms (lemmas()) of that meaning to the list_of_lemmas
        [
            add_to_list(list_of_lemmas, lemma.name())
            for lemma in meaning.lemmas(language)
        ]

        ## loop over the list of all possible hyponyms
        for hyponym in meaning.closure(hypos):

            list_of_meanings += [
                [
                    hyponym,
                    hyponym.definition(),
                    [lemma.name() for lemma in hyponym.lemmas(language)],
                ]
            ]

            ## append all synonyms (lemmas()) of that hyponym to the list_of_lemmas
            [
                add_to_list(list_of_lemmas, lemma.name())
                for lemma in hyponym.lemmas(language)
            ]

    # remove duplicates and sort
    set_of_lemmas = sorted([*set(list_of_lemmas)])

    ##length
    length = len(set_of_lemmas)

    return (set_of_lemmas, length, list_of_meanings)


set_of_lemmas, length, list_of_meanings = generate_word_list(word, language)

<div class="alert alert-warning">What was a bit confusing here was that initially the function didn't return the set, but the list of lemmas (with potential duplicates). I changed this here so that the function returns the set.</div>

### Using `input()` 

We can use the python built-in function `input()` to collect a keypress from the user and return a behavior:

In [31]:
while True:
    x = input()
    if x == "y":
        print("Keep!")
        break
    if x == "n":
        print("Ditch!")
        break
    else:
        print("Please press y or n!")

 y


Keep!


### Try the same with `ipywidgets` 


In [32]:
import ipywidgets as widgets
from itertools import compress


selection_list = ["Item A", "Item B", "Item C", "Item D", "Item E"]

selection_widget = widgets.VBox(
    [
        widgets.Checkbox(value=True, description=item, disabled=False, indent=False)
        for item in selection_list
    ]
)

selection_widget

VBox(children=(Checkbox(value=True, description='Item A', indent=False), Checkbox(value=True, description='Ite…

In [33]:
checked_list = list(
    compress(selection_list, [widget.value for widget in selection_widget.children])
)

checked_list

['Item A', 'Item B', 'Item C', 'Item D', 'Item E']

#### Why we need this

We use this function to let the user decide which meanings to keep and which to remove. It will iterate over the set of meanings. For each meaning we are asked to type 1 if we want to keep the meaning, otherwise is not kept. The input is a list of meanings

#### Alternative approach with widgets

In [35]:
selection_widget = widgets.VBox(
    [
        widgets.VBox(
            [
                widgets.Checkbox(
                    value=True,
                    description=item[0].name(),
                    indent=False,
                ),
                widgets.Label(
                    item[1],
                    layout=widgets.Layout(
                        margin="0 0 0 25px",
                    ),
                ),
                widgets.Label(
                    str(item[2]),
                    layout=widgets.Layout(
                        margin="0 0 0 25px",
                    ),
                ),
            ]
        )
        for item in list_of_meanings
    ]
)

selection_widget

VBox(children=(VBox(children=(Checkbox(value=True, description='disregard.n.01', indent=False), Label(value='l…

Create final list from checkbox states

In [36]:
filtered_list = list(
    compress(
        list_of_meanings,
        [widget.children[0].value for widget in selection_widget.children],
    )
)

filtered_list

[[Synset('disregard.n.01'),
  'lack of attention and due care',
  ['disregard', 'neglect']],
 [Synset('omission.n.04'),
  'neglecting to do something; leaving out or passing over something',
  ['omission']],
 [Synset('exception.n.01'),
  'a deliberate act of omission',
  ['exception', 'exclusion', 'elision']],
 [Synset('oversight.n.01'),
  'an unintentional omission resulting from failure to notice something',
  ['oversight', 'inadvertence']],
 [Synset('pretermission.n.01'),
  'letting pass without notice',
  ['pretermission']],
 [Synset('neglect.n.02'),
  'the state of something that has been unused and neglected',
  ['neglect', 'disuse']],
 [Synset('omission.n.02'), 'something that has been omitted', ['omission']],
 [Synset('disregard.n.02'),
  'willful lack of care and attention',
  ['disregard', 'neglect']],
 [Synset('despite.n.02'), 'contemptuous disregard', ['despite']],
 [Synset('negligence.n.02'),
  'the trait of neglecting responsibilities and lacking concern',
  ['negligence'

## Part II: Script to generate bag of words

Let's put all together so that we can basically input a seed word and have the pipeline run automatically 

<div class="alert alert-info">Define functions and import again</div>

In [39]:
import nltk

# nltk.download("all")
from nltk.corpus import wordnet as wn


def generate_word_list(seed_word, language):
    """Returns a list of meanings"""

    ## we create an empty list to store the final word list
    list_of_lemmas = []
    list_of_meanings = []

    ## a function to add a word to a list
    add_to_list = lambda list1, item1: list1.append(item1)

    ## a function to return the hyponyms of a synset
    hypos = lambda s: s.hyponyms()

    ## wn.synset obtains the list of synonyms and meanings for that word, in different syntactic categories
    meanings = wn.synsets(seed_word, pos=wn.NOUN + wn.VERB + wn.ADJ)

    ## loop over set of meanings in synset
    for meaning in meanings:

        ## print the definition of that meaning
        # print(meaning, meaning.definition(), [lemma.name() for lemma in meaning.lemmas(language)])
        list_of_meanings += [
            [
                meaning,
                meaning.definition(),
                [lemma.name() for lemma in meaning.lemmas(language)],
            ]
        ]

        ## append all synonyms (lemmas()) of that meaning to the list_of_lemmas
        [
            add_to_list(list_of_lemmas, lemma.name())
            for lemma in meaning.lemmas(language)
        ]

        ## loop over the list of all possible hyponyms
        for hyponym in meaning.closure(hypos):

            ## print the definition of each hyponym
            # print(hyponym, hyponym.definition(), [lemma.name() for lemma in hyponym.lemmas(language)])
            list_of_meanings += [
                [
                    hyponym,
                    hyponym.definition(),
                    [lemma.name() for lemma in hyponym.lemmas(language)],
                ]
            ]

            ## append all synonyms (lemmas()) of that hyponym to the list_of_lemmas
            [
                add_to_list(list_of_lemmas, lemma.name())
                for lemma in hyponym.lemmas(language)
            ]

    ##eliminate list duplations by applying the set transformation
    set_of_lemmas = [*set(list_of_lemmas)]

    ## sort alphabetically
    set_of_lemmas.sort()

    ##length
    length = len(set_of_lemmas)

    return (list_of_lemmas, length, list_of_meanings)


def select_meanings(list_meanings):
    """
    Function to decide which meanings to keep and which to remove.
    This function iterates over the set of meanings for each meaning
    we are asked to type 1 if we want to keep the meaning, otherwise is not kept
    the input is a list of meanings
    """

    selected_word_list = []

    print("the list has:", len(list_meanings), "items")

    count = 0

    for item in list_meanings:

        count += 1
        print(count, item)
        print("type 1 if the meaning is adequate, 0 otherwise")

        x = input()

        if x == "1":
            try:
                selected_word_list += item[2]
            except TypeError:
                continue

        else:
            continue

    selected_word_list = sorted([*set(selected_word_list)])

    return selected_word_list


def generate_and_filter(seed_word, language):
    """Wrapper function"""

    list_of_synsets = generate_word_list(seed_word, language)[2]
    filtered_list = select_meanings(list_of_synsets)

    return filtered_list

Run it with some some seed word

In [40]:
selection_widget = widgets.VBox(
    [
        widgets.VBox(
            [
                widgets.Checkbox(
                    value=True,
                    description=item[0].name(),
                    indent=False,
                ),
                widgets.Label(
                    item[1],
                    layout=widgets.Layout(
                        margin="0 0 0 25px",
                    ),
                ),
                widgets.Label(
                    str(item[2]),
                    layout=widgets.Layout(
                        margin="0 0 0 25px",
                    ),
                ),
            ]
        )
        for item in list_of_meanings
    ]
)

selection_widget

VBox(children=(VBox(children=(Checkbox(value=True, description='disregard.n.01', indent=False), Label(value='l…

Create final list from checkbox states

In [44]:
bag_of_words = list(
    compress(
        list_of_meanings,
        [widget.children[0].value for widget in selection_widget.children],
    )
)

bag_of_words

[[Synset('disregard.n.01'),
  'lack of attention and due care',
  ['disregard', 'neglect']],
 [Synset('omission.n.04'),
  'neglecting to do something; leaving out or passing over something',
  ['omission']],
 [Synset('exception.n.01'),
  'a deliberate act of omission',
  ['exception', 'exclusion', 'elision']],
 [Synset('oversight.n.01'),
  'an unintentional omission resulting from failure to notice something',
  ['oversight', 'inadvertence']],
 [Synset('pretermission.n.01'),
  'letting pass without notice',
  ['pretermission']],
 [Synset('neglect.n.02'),
  'the state of something that has been unused and neglected',
  ['neglect', 'disuse']],
 [Synset('omission.n.02'), 'something that has been omitted', ['omission']],
 [Synset('disregard.n.02'),
  'willful lack of care and attention',
  ['disregard', 'neglect']],
 [Synset('despite.n.02'), 'contemptuous disregard', ['despite']],
 [Synset('negligence.n.02'),
  'the trait of neglecting responsibilities and lacking concern',
  ['negligence'

---
## Find and count words in a given text that are also in the bag of words

<div class="alert alert-info">Just for illustration, you can jump directly to the next heading "Exercise: Function to calculate the frequency ..."</div>

In [45]:
print("bag of words:\n", bag_of_words)

bag of words:
 [[Synset('disregard.n.01'), 'lack of attention and due care', ['disregard', 'neglect']], [Synset('omission.n.04'), 'neglecting to do something; leaving out or passing over something', ['omission']], [Synset('exception.n.01'), 'a deliberate act of omission', ['exception', 'exclusion', 'elision']], [Synset('oversight.n.01'), 'an unintentional omission resulting from failure to notice something', ['oversight', 'inadvertence']], [Synset('pretermission.n.01'), 'letting pass without notice', ['pretermission']], [Synset('neglect.n.02'), 'the state of something that has been unused and neglected', ['neglect', 'disuse']], [Synset('omission.n.02'), 'something that has been omitted', ['omission']], [Synset('disregard.n.02'), 'willful lack of care and attention', ['disregard', 'neglect']], [Synset('despite.n.02'), 'contemptuous disregard', ['despite']], [Synset('negligence.n.02'), 'the trait of neglecting responsibilities and lacking concern', ['negligence', 'neglect', 'neglectfulne

Text to analyze

In [46]:
text = """
            john threw a tantrum so big that made his mother explode with fury, not only
            towards him but also towards the exasperating gods who angered her every day
       """

tokenized_text = nltk.word_tokenize(text)

# get frequency distributions
freqs = nltk.FreqDist(tokenized_text)

# get counts of each word in text1
word_counts = freqs.most_common()

print("word counts:\n", word_counts)

word counts:
 [('towards', 2), ('john', 1), ('threw', 1), ('a', 1), ('tantrum', 1), ('so', 1), ('big', 1), ('that', 1), ('made', 1), ('his', 1), ('mother', 1), ('explode', 1), ('with', 1), ('fury', 1), (',', 1), ('not', 1), ('only', 1), ('him', 1), ('but', 1), ('also', 1), ('the', 1), ('exasperating', 1), ('gods', 1), ('who', 1), ('angered', 1), ('her', 1), ('every', 1), ('day', 1)]


In [47]:
bag_of_words_count = 0

## loop over the words in the frequency distribution list of the text
for wc in word_counts:
    word = wc[0]
    count = wc[1]

    ##if the word is in bag of words, add the count of words to the counter
    if word in bag_of_words:

        print(word)
        bag_of_words_count += count

print(bag_of_words_count)

0


## Exercise: Function to calculate the frequency of any bag of words in any text

In [48]:
def get_bag_of_words_count(text, bow):
    """
    Returns the frequency of a given bag of words in a given text
    """

    tokenized_text = nltk.word_tokenize(text)
    freqs = nltk.FreqDist(tokenized_text)
    word_counts = freqs.most_common()

    bow_count = 0

    for wc in word_counts:
        word = wc[0]
        count = wc[1]

        if word in bow:
            bow_count += count

    return bow_count

Run function with some text

In [49]:
text = """
            john threw a tantrum so big that made his mother explode with fury, not only
            towards him but also towards the exasperating gods who angered her every day
       """

print(
    "The text contains",
    get_bag_of_words_count(text, bag_of_words),
    "words from the bag of words.",
)

The text contains 0 words from the bag of words.


In [50]:
text = """
            at first, the bright day made him feel alive, and then dead, then alive again,
            ultimately all was an illusion, including the inappropriate love that he now
            felt for his sister
        """

print(
    "The text contains",
    get_bag_of_words_count(text, bag_of_words),
    "words from the bag of words.",
)

The text contains 0 words from the bag of words.


# Gold standard tool to perform bag of words approach

https://www.liwc.app/


<div class="alert alert-warning">Unfortunately, the demo on the page doesn't seem to work.</div>