# Wordnet sandbox

## Introduction

In Real Life you’ll export the words you care about from your XML using XSLT and then read the list into your Python program, and we’ll talk about how to do that below. To start, though, let’s concentrate on learning how Wordet works. We’re writing this in an interface that allows us to break up the code into pieces, which means that in order to run the statements at the bottom of the page, you need to have run at least some of the ones at the top. For example, we import Wordnet at the beginning with `from nltk.corpus import wordnet as wn`, and later code depends on our having done that. If you copy and try to run something below without having done the import, you’ll throw an error.

## Setup

Create a list of sample words, and examine their synsets. We’ve included four spooky words plus one non-spooky control item.

In [104]:
from nltk.corpus import wordnet as wn # import Wordnet and call it just 'wn' for brevity
words = ['scare', 'ghost', 'fright', 'spook', 'koala'] # create a list of words to examine
synset_list =[wn.synsets(word) for word in words] # get the synsets for each word
synset_list # display it

[[Synset('panic.n.02'),
  Synset('scare.n.02'),
  Synset('frighten.v.01'),
  Synset('daunt.v.01')],
 [Synset('ghost.n.01'),
  Synset('ghostwriter.n.01'),
  Synset('ghost.n.03'),
  Synset('touch.n.03'),
  Synset('ghost.v.01'),
  Synset('haunt.v.02'),
  Synset('ghost.v.03')],
 [Synset('fear.n.01'), Synset('frighten.v.01')],
 [Synset('creep.n.01'), Synset('ghost.n.01'), Synset('spook.v.01')],
 [Synset('koala.n.01')]]

## Choose the correct synset

This part requires human analysis, since although the program can recognize the words, it can’t tell which of the possible meanings a word has at a particular location. If you’re working with a word that isn’t in Wordnet, make a note of that, but there isn’t anything else that you can do, since you can’t add words to Wordnet. Note that the same word may represent different synsets in different locations. For example, ‘scare’ could be a noun in one place and a verb in a different place, and those are different synsets.

### First get the definitions of each synset for each word

We use a counter (`i`) so that we can see easily which synsets go with the same word:

In [105]:
for i in range(len(synset_list)):
    [print(i, item, item.definition()) for item in synset_list[i]]

0 Synset('panic.n.02') sudden mass fear and anxiety over anticipated events
0 Synset('scare.n.02') a sudden attack of fear
0 Synset('frighten.v.01') cause fear in
0 Synset('daunt.v.01') cause to lose courage
1 Synset('ghost.n.01') a mental representation of some haunting experience
1 Synset('ghostwriter.n.01') a writer who gives the credit of authorship to someone else
1 Synset('ghost.n.03') the visible disembodied soul of a dead person
1 Synset('touch.n.03') a suggestion of some quality
1 Synset('ghost.v.01') move like a ghost
1 Synset('haunt.v.02') haunt like a ghost; pursue
1 Synset('ghost.v.03') write for someone else
2 Synset('fear.n.01') an emotion experienced in anticipation of some specific pain or danger (usually accompanied by a desire to flee or fight)
2 Synset('frighten.v.01') cause fear in
3 Synset('creep.n.01') someone unpleasantly strange or eccentric
3 Synset('ghost.n.01') a mental representation of some haunting experience
3 Synset('spook.v.01') frighten or scare, and 

### Choose the appropriate synset for each spooky word _in context_

Look at your XML and choose the appropriate synset for each word _in context_. For example, if ‘scare’ occurs as a verb that means ‘cause fear in’ in one place, the synset you‘d choose from above would be `frighten.v.01`. If it occurs as a noun that means ‘a sudden attack of fear’ in another, you‘d choose `scare.n.02`.

### Write the synset information back into the XML

You can’t write this back into the XML automatically because the same word form in the XML might belong to different synsets in different locations (like the use of ‘scare’ as a verb or as a noun, described above). For that reason, you’ll want to add the synset value manually to the tagged words in your XML. For example, if you have:

    <p>He <spooky_word>scared</spooky_word> them.</p>

You would expand the markup to:

    <p>He <spooky_word synset="frighten.v.01">scared</spooky_word> them.</p>

The easiest way to add this type of markup is to load the document into &lt;oXygen/&gt; and do a search and replace to search for the string 

    <spooky_word

and replace it with

    <spooky_word synset=""

This will write the `@synset` attribute into the start tag with a null value, and you can then use the XPath browser box to find all `<spooky_word>` elements and type in the attribute values. You’ll want to modify your schema so that this new attribute will be valid.

## Examine the lemmata for each synset

At the moment this is just for curiosity.

In [107]:
scare_synsets = [wn.synset('scare.n.02'), wn.synset('frighten.v.01')]
for synset in scare_synsets:
    print(str(synset) + ' has the following lemmata: ' + str([lemma.name() for lemma in synset.lemmas()]))

Synset('scare.n.02') has the following lemmata: ['scare', 'panic_attack']
Synset('frighten.v.01') has the following lemmata: ['frighten', 'fright', 'scare', 'affright']


## Tasks

### Explore lexical ambiguity

Word forms in your text will belong to zero or more synsets, although an occurrence of a word form will belong to only one synset in a particular context. You can quantify the degree of ambiguity, and thus the extent to which the meaning of the word depends on context, by retrieving the number of synsets for each word form in your data. Here’s how to do that:

In [95]:
for word in words:
    synset_count = len(wn.synsets(word))
    print('Word "' + word + '" belongs to ' + str(synset_count) + ' synsets')

Word "scare" belongs to 4 synsets
Word "ghost" belongs to 7 synsets
Word "fright" belongs to 2 synsets
Word "spook" belongs to 3 synsets
Word "koala" belongs to 1 synsets


The preceding is fine for humans, but we want to write these counts back into our XML. We can do that automatically in three steps:

1. Use XSLT to export a plain text list of words you’ve tagged (e.g., spooky words) from your XML data files.
1. Use Python to create an XML auxiliary document that maps each of those words to its synset count. The Python script will read the exported plain text list, use Wordnet to count the number of synsets associated with each of them, and write the word plus the count into an XML document.
1. Use an XSLT _identity transformation_ to write the synset count into the XML as new content. Your XSLT transformation will transform each of your XML data files to itself (that is, the output will be identical to the input), except that it will insert an additional attribute that includes the count of synsets associated with the word form.

Here’s how that works:

#### Step 1: Export a plain text list of words you’ve tagged (e.g., spooky words)

Here’s some original sample XML:

    <root>
        <p>The <spooky_word>ghost</spooky_word> <spooky_word>scared</spooky_word> 
        him by giving him a <spooky_word>scare</spooky_word>.</p>
    </root>

We manually add the synset markup:

    <root>
        <p>The <spooky_word synset="ghost.n.03">ghost</spooky_word> 
        <spooky_word synset="frighten.v.01">scared</spooky_word> 
        him by giving him a 
        <spooky_word synset="scare.n.02">scare</spooky_word>.</p>
    </root>

We then run the following XSLT transformation, outputting plain text:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
        <xsl:output method="text" indent="yes"/>
        <xsl:template match="/">
            <xsl:apply-templates select="//spooky_word"/>
        </xsl:template>
        <xsl:template match="spooky_word">
            <xsl:value-of select="concat(.,'&#x0a;')"/>
        </xsl:template>
    </xsl:stylesheet>

Note that the value of the `@method` attribute on the `<xsl:output>` element is "text" because we’re creating plain text. We apply templates to the `<spooky_word>` elements, and in the template that matches those elements, we output the value of concatenating the content of the element (the word itself) with a new line character (spelled `&#x0a;`, which is the _numerical character reference_ for a new line). The output looks like:

    ghost
    scared
    scare

We can save that to a file (let’s call it “spooky_words.txt”), so that we can access it later with Python.

#### Step 2: Access that file with Python and create a new XML file that maps each word form to its synset count

In [96]:
infile = open('spooky_words.txt','r') # open the list of spookey words that we just exported for reading 
wordlist = infile.read().split() # get the words from the file
infile.close() # close the input file, since we’re read it all
outfile = open('synset_counts.xml','w') # open a file for writing to hold the output
outfile.write('<root>') # create a start tag for the root element in the output XML file
for word in wordlist: # create output for each word
    synset_count = len(wn.synsets(word)) # for each word, count the number of synsets to which it belongs
    outfile.write('<word><form>' + word + '</form><count>' + str(synset_count) + '</count></word>') # write it out
outfile.write('</root>') # create the end tag for the root element
outfile.close() # close the output file, since we’ve written all the output

We saved the output to a file, so we don’t see it here in the notebook, but we can now read it. This is just for human inspection, to make sure that it looks the way we want. It isn’t pretty-printed, but we can still see how it looks:

In [97]:
with open('synset_counts.xml') as infile:
    print(infile.read())

<root><word><form>ghost</form><count>7</count></word><word><form>scared</form><count>4</count></word><word><form>scare</form><count>4</count></word></root>


#### Step 3: To write the counts back into the XML, use an _identity transformation_, reading in the new count file with the XPath `document()` function

Assume that we’ve saved our original XML (with the synsets, but without the counts) as original.xml. It looks like:

    <root>
        <p>The <spooky_word synset="ghost.n.03">ghost</spooky_word> 
        <spooky_word synset="frighten.v.01">scared</spooky_word> 
        him by giving him a 
        <spooky_word synset="scare.n.02">scare</spooky_word>.</p>
    </root>

Transform it with the following XSLT:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        exclude-result-prefixes="xs"
        version="2.0">
        <xsl:variable name="count_file" as="document-node()" select="document('synset_counts.xml')"/>
        <xsl:template match="node()|@*">
            <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="spooky_word">
            <xsl:copy>
                <xsl:attribute name="synset_count" select="$count_file//word[form eq current()]/count"/>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>

The `document()` function opens synset_counts.xml (which we created with Python in Step #2) so that we can access it (using the variable name `$count_file`) while we’re transforming original.xml. The first template is an _identity transformation_, which you can read about at https://en.wikipedia.org/wiki/Identity_transform. When you perform an identity transformation, the identity template transforms everything to itself (that is, the output is a copy of the input), except that you write separate templates only for the bits that you want to change. In this case, we’re changing `<spooky_word>` elements to add a new `@synset_count` attribute, inserting the value it copies from the auxiliary file that we created with Python in the preceding step.

Here’s the output of that last transformation:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <p>The <spooky_word synset_count="7" synset="ghost.n.03">ghost</spooky_word> 
            <spooky_word synset_count="4" synset="frighten.v.01">scared</spooky_word> 
            him by giving him a 
            <spooky_word synset_count="4" synset="scare.n.02">scare</spooky_word>.</p>
    </root>

We can then calculate the extent of ambiguity for the entire document or for each individual paragraph. We might decide that the ambiguity of a paragraph is the average of all of the `@synset_count` values in that paragraph, so that for the sole paragraph here it would be 5, that is, the sum of the three values (15) divided by the number of values (3). We could graph this with SVG to examine whether there’s a pattern to the ambiguity, that is, whether it’s higher in some locations of the story than in others. We could also look for correlations between, say, the number of spooky words and the degree of ambiguity. Or we could compare stories or authors to see whether one there is any regularity or other pattern in the ambiguity.

### Determine the number of representations of each synset in each document

You can use XSLT to determine which synsets are favored in which texts or by which authors or at which periods. Consider the following input document:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <p>The <spooky_word synset="ghost.n.03">ghost</spooky_word>
            <spooky_word synset="frighten.v.01">scared</spooky_word> and <spooky_word
                synset="frighten.v.01">frightened</spooky_word> him by giving him a <spooky_word
                synset="scare.n.02">scare</spooky_word>.</p>
    </root>

This has four spooky words representing three different synsets. We can count the number of occurrences of each synset using XSLT:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
        <xsl:output method="xml" indent="yes"/>
        <xsl:variable name="root" select="/"/>
        <xsl:template match="/">
            <data>
                <xsl:for-each select="distinct-values(//spooky_word/@synset)">
                    <synset_count>
                        <synset>
                            <xsl:value-of select="current()"/>
                        </synset>
                        <count>
                            <xsl:value-of select="count($root//spooky_word[@synset eq current()])"/>
                        </count>

                    </synset_count>
                </xsl:for-each>
            </data>
        </xsl:template>
    </xsl:stylesheet>

We set a variable called `$root` because when we do `<xsl:for-each>` over distinct values we cut ourselves off from the tree, so if we want to get back to it, we need to access it through that variable. Here we get each distinct `@synset` value and count the number of `<spooky_word>` elements that have a `@synset` attribute with that value. In this case the output is:

    <?xml version="1.0" encoding="UTF-8"?>
    <data>
       <synset_count>
          <synset>ghost.n.03</synset>
          <count>1</count>
       </synset_count>
       <synset_count>
          <synset>frighten.v.01</synset>
          <count>2</count>
       </synset_count>
       <synset_count>
          <synset>scare.n.02</synset>
          <count>1</count>
       </synset_count>
    </data>

We could transform that to HTML to SVG for display. The counts let us ask: do some works or authors show a preference for certain synset expressions of spookiness?

### Explore the richness of the expression of spookiness

Since we’ve already assigned a synset to each spooky word in our text, we can count the number of different synsets in the text. Do some writers represent spookiness with a greater range of spooky-related meanings, that is, with more synsets, than other writers? Because texts may be of different length, we might want not just to count the number of different synsets, but to express the value as the result of dividing the number of spooky word instances by the number of distinct synsets. We can do that with XSLT and write the result into the document as metadata, performing another identity transformation and this time just adding the count in a new element. Assume our input is the output of the last operation, that is:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <p>The <spooky_word synset_count="7" synset="ghost.n.03">ghost</spooky_word> 
            <spooky_word synset_count="4" synset="frighten.v.01">scared</spooky_word> 
            him by giving him a 
            <spooky_word synset_count="4" synset="scare.n.02">scare</spooky_word>.</p>
    </root>

Apply the following XSLT transformation:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
        <xsl:output method="xml" indent="yes"/>
        <xsl:template match="node() | @*">
            <xsl:copy>
                <xsl:apply-templates select="@* | node()"/>
            </xsl:copy>
        </xsl:template>
        <xsl:template match="root">
            <xsl:copy>
                <meta>
                    <spookiness_ratio>
                        <xsl:value-of
                            select="count(distinct-values(//spooky_word/@synset)) div count(//spooky_word)"
                        />
                    </spookiness_ratio>
                </meta>
                <xsl:apply-templates/>
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>

We start with the identity transformaiton, but when we match our root element (which we’ve arbitrarily called `<root>`), before we apply templates (that is, process its contents) we create a new `<meta>` child, which contains a `<spookiness_ratio>` element, and we calculate and insert the value there. In this case it turns out to be 1 because there are three `<spooky_word>` elements and three distinct `@synset` values. The fewer the number of synsets, the lower the value will be. If we use our sample input from above:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <p>The <spooky_word synset="ghost.n.03">ghost</spooky_word>
            <spooky_word synset="frighten.v.01">scared</spooky_word> and <spooky_word
                synset="frighten.v.01">frightened</spooky_word> him by giving him a <spooky_word
                synset="scare.n.02">scare</spooky_word>.</p>
    </root>

and run the same transformation, the value is 0.75 because there are four spooky words and three distinct synsets.

This can be analogized to the _type/token ratio_ in corpus linguistics, where types are _distinct_ items (such as _different_ words in a text) and tokens are the items (such as words in the same text, regardless of whether they’re duplicates of other words that we’ve already seen). A high type/token ratio means that the text is lexically varied, with little repetition of words. A low ratio means a less varied vocabulary. In this case the number of spooky words is our token count, and the number of distinct synsets is our type count. A high type/token ratio means that spookiness is expressed in a wider variety of ways; a value of 1 would mean that no synset is repeated. A low ratio would mean that the variety is less; the value cannot be 0 if there’s any spookiness at all, but the least varied possibility is that there are lots of spooky words, but they all represent that same synset.

Type/token ratios are sensitive to text length. This is easiest to see at the extreme: the number of distinct words in a language may be very large, but it isn’t infinite (at least, it isn’t infinite in any real language context), while texts can be arbitrarily long. That means that after your text reaches a certain length, you don’t know any words you haven’t used already, so you have to start repeating. The dependence of type/token ratio on text length means that if you want to compare type/token ratios, you can do that meaningfully only for texts of the same length. For that reason, if you want to compare our spookiness analogy across texts, you should use texts of the same length. 

### Explore the richness of the vocabulary (by writer or by text)

Synsets are represented by one or more lemmata, which you can retrieve with the `lemmas()` method, as in:

In [98]:
synsets = wn.synsets('ghost')
for synset in synsets:
    lemmata = synset.lemmas()
    print(str(synset) + ' means "' + synset.definition() + '" and has ' + str(len(lemmata)) + ' lemmata: ' + \
         str([lemma.name() for lemma in lemmata]))

Synset('ghost.n.01') means "a mental representation of some haunting experience" and has 6 lemmata: ['ghost', 'shade', 'spook', 'wraith', 'specter', 'spectre']
Synset('ghostwriter.n.01') means "a writer who gives the credit of authorship to someone else" and has 2 lemmata: ['ghostwriter', 'ghost']
Synset('ghost.n.03') means "the visible disembodied soul of a dead person" and has 1 lemmata: ['ghost']
Synset('touch.n.03') means "a suggestion of some quality" and has 3 lemmata: ['touch', 'trace', 'ghost']
Synset('ghost.v.01') means "move like a ghost" and has 1 lemmata: ['ghost']
Synset('haunt.v.02') means "haunt like a ghost; pursue" and has 3 lemmata: ['haunt', 'obsess', 'ghost']
Synset('ghost.v.03') means "write for someone else" and has 2 lemmata: ['ghost', 'ghostwrite']


We use the `name()` method to get just the lexical part of the lemma.

A writer or text that uses the synset 'ghost.n.01' has six lemmata available to express that meaning. What proportion of the available vocabulary does your writer or text use? 

That would be easy to calculate if the writer always used the exact form provided by the `name()` method of lemmata. You might find that a particular text contains the following mappings of lemmata and word forms:

Synset | Word form
--- | ---
ghost.n.01 | ghost
ghost.n.01 | shade
ghost.n.01 | spook

You can count up the number of word forms associated with each synset, and because each word form corresponds to a different one of the 6 lemmata for that synset, you’ll determine correctly that the writer or text uses 50% of the available lemmata.

In [99]:
available = [lemma.name() for lemma in wn.synset('ghost.n.01').lemmas()]
used = ['ghost', 'shade', 'spook']
print(len(used) / len(available))

0.5


But suppose the forms include different inflections of the same lemma, such as singular 'ghost' and plural 'ghosts'. The challenge here is that those two forms represent the same lemma, so you can’t simply count forms and use that as a surrogate for counting lemmata. Wordnet helps resolve these situations with `wn.morphy()`, which lemmatizes (we’ve added the numbers just to make the output easier to read):

In [100]:
print(1, wn.morphy('ghost'))
print(2, wn.morphy('ghosts'))

1 ghost
2 ghost


This means that we can resolve that variation along the following lines. Here we have the same list as above, except that instead of three items in our `used` variable that correspond to three different lemmata, we have three items that correspond to only two lemmata. Here we print the values of `used` and `normalized` to show that they have the same length, but `normalized` has only two distinct values, while `used` has three. We then convert `normalized` from a list (which allows duplicates) to a set (which doesn’t), which is a quick way of removing duplicates:

In [101]:
available = [lemma.name() for lemma in wn.synset('ghost.n.01').lemmas()]
used = ['ghost', 'ghosts', 'spook']
normalized = [wn.morphy(item) for item in used]
print(used)
print(normalized)
print(len(set(normalized)) / len(available))

['ghost', 'ghosts', 'spook']
['ghost', 'ghost', 'spook']
0.3333333333333333


By using `wn.morphy()`, then, we can determine the richness of the vocabulary (the number of available different lemmata that are actually used) without being misled by different inflected forms of the same lexeme. Of course we still have to decide how to use these counts to explore or present information about how much of the available vocabulary variation the writer or text actually uses.

### Words and phrases

Wordnet is primarily about words, and it contains little information about phrases. To the extent that spookiness is expressed in a phrase, if the spooky quality of the phrase depends on a particular word, it may be more useful to tag the word than the entire phrase, since you can look up the word in Wordnet. But in the case of idioms and other phrasal expressions, the spooky quality may not belong to any single word, and in that case you have to tag the entire phrase. You won’t find spooky phrases in Wordnet, but you can use XSLT to determine how much spooky meaning is expressed at a phrasal level that cannot be regarded as obtaining its spookiness from specific individual words. The XSLT for this is so easy that we won’t write out the code; you can retrieve all of the `<spooky_word>` elements and distinguish the ones that are phrases from the ones that aren’t by filtering them with `matches()`, along the lines of:

    //spooky_word[matches(., '\s')]

This retrieves all `<spooky_word>` elements and filters them to retain only the ones that match a regex pattern that includes a white space character ('\s'). You can read about the `matches()` function in Michael Kay. We use `matches()` instead of `contains()` because `contains()` looks at strings, and the words of a phrase may usually be separated by a space character, but they could be separated by a new line character, which isn’t the same string as a space character. But because `matches()` uses regex where `contains()` uses strings, `matches()` can ask “does this item contain any white space character, whether it’s a space character or a new line?”

How you use this information is up to you, but it will tell you how much a writer or text depends on spooky words, and how much the spookiness is conveyed at a phrasal, rather than lexical level.