Wordnet.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Wordnet sandbox\n",
    "\n",
    "Maintained by David J. Birnbaum, [djbpitt@gmail.com](mailto:djbpitt@gmail.com), http://www.obdurodon.org\n",
    "\n",
    "## Preface\n",
    "\n",
    "This tutorial illustrates the use of Wordnet for the types of exploration to be conducted in the [Dante’s _Inferno_](http://dante.obdurodon.org) and [Victorian ghost stories](http://ghost.obdurodon.org) research projects that were part of a [Computational methods in the humanities](http://dh.obdurodon.org) course in the autumn 2016 academic semester. Thanks to Na-Rae Han for discussion and suggestions.\n",
    "\n",
    "Students completing [Computational methods in the humanities](http://dh.obdurodon.org) to satisfy the “methods” requirement for the Linguistics major need to perform some linguistic tasks with their data, and Wordnet is one way to do that. Below, after an introduction to how Wordnet works, we describe how to add Wordnet-related markup to your XML and how to use that markup to explore your data. You do not need to add Wordnet-related markup to all of your data (which would not be feasible within the context of a semester-long course because some of the work must be performed manually and your documents may be long), but you should do enough of it to be able to experiment a bit with how it works. You also do not have to perform all of the tasks we describe below (which also would not be feasible in the available time); pick one or two that sound interesting and see what you’re able to learn about your documents by implementing them. Ask your instructors should you have any questions about either the content of this tutorial (that is, about how to use Wordnet) or the scope of the assignment.\n",
    "\n",
    "**tl;dr:** Use Wordnet as described below to add semantic markup to some (not all) of your data. Then perform some (not all) of the tasks below to explore how meaning is represented in your texts.\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In Real Life you’ll export the words you care about from your XML using XSLT and then read the list into your Python program, but to start, let’s concentrate on learning how Wordet works. We’re writing this tutorial in the **Jupyter notebook** interface, which allows us to break up the code into pieces that are interspersed with discussion. Because the code is fragmented, in order to run the statements at the bottom of the page you need to have run at least some of the ones at the top. For example, we import Wordnet at the beginning with `from nltk.corpus import wordnet as wn`, and later code depends on our having done that. This means that if you copy and try to run something below without having done the import, you’ll throw an error. We also create some variables near the top that we use below without redeclaring them. You don’t need to use Jupyter notebook for your own development; we’ve used it here because the combination of code cells and text cells is convenient for tutorial purposes.\n",
    "\n",
    "**tl;dr:** Run the code from the top of this notebook to the bottom, and not just in a single cell.\n",
    "\n",
    "## How Wordnet is organized\n",
    "\n",
    "Wordnet is a hierarchical organization of units of meaning, called **synsets**. Synsets are represented in texts by **words**, and a combination of a **lexeme** (represented by the dictionary form of a word) with a specific synset is called a **lemma**. Synsets are identified within Wordnet by three dot-separated parts:\n",
    "\n",
    "1. A representative word, that is, a word that conveys the meaning of the synset. This representative word may not be the only word that conveys that meaning, and it may also be able to convey other meanings. We’ll see below that that the lexeme ‘ghost’ can represent several different meanings (that is, is associated with multiple synsets), and that each of those meanings can alternatively be conveyed by lexemes other than “ghost”.\n",
    "1. A part of speech (POS) identifier, like “n” for ‘noun’ or “v” for ‘verb’.\n",
    "1. A two-digit number that distinguishes different synsets that may have the same head word and the same POS, but that convey different meaning. For example, the synsets 'ghost.n.01' and 'ghost.n.02' are two different nominal meanings that can be expressed by the lexeme “ghost.”\n",
    "\n",
    "### Exploring synsets\n",
    "\n",
    "There’s a lot more organization within Wordnet, but for the purpose of this tutorial we’re going to stick to the information conveyed through synsets. Let’s explore that with the synset 'koala.n.01', which is a noun that represents a particular arboreal Australian marsupial. Here’s how it looks when we ask Python about it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Synset('koala.n.01')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import wordnet as wn # import Wordnet and call it just “wn” for brevity\n",
    "wn.synset('koala.n.01')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output above tells us that the synset 'koala.n.01' is a synset that Wordnet calls 'koala.n.01'. That tautology isn’t very useful, so the only point of the code snippet above is to determine whether such a synset exists. If it doesn’t, we’ll get an error. You can test this by running the cell below, which will raise an error because there is no 'koala.n.02' synset in Wordnet (your error message may differ from ours):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "WordNetError",
     "evalue": "lemma 'koala' with part of speech 'n' has only 1 sense",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mIndexError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m/Users/djb/anaconda/lib/python3.5/site-packages/nltk/corpus/reader/wordnet.py\u001b[0m in \u001b[0;36msynset\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   1233\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1234\u001b[0;31m             \u001b[0moffset\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_lemma_pos_offset_map\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mlemma\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mpos\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0msynset_index\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1235\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mIndexError\u001b[0m: list index out of range",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mWordNetError\u001b[0m                              Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-3-5158dbbef11c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mwn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msynset\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'koala.n.02'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/Users/djb/anaconda/lib/python3.5/site-packages/nltk/corpus/reader/wordnet.py\u001b[0m in \u001b[0;36msynset\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m   1243\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1244\u001b[0m                 \u001b[0mtup\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlemma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpos\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mn_senses\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"senses\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1245\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mWordNetError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmessage\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mtup\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1246\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1247\u001b[0m         \u001b[0;31m# load synset information from the appropriate file\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mWordNetError\u001b[0m: lemma 'koala' with part of speech 'n' has only 1 sense"
     ]
    }
   ],
   "source": [
    "wn.synset('koala.n.02')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we know that there is a 'koala.n.01' synset but not 'koala.n.02' synset without having to ask for the latter and raising an error? We can ask Wordnet to tell us about all of the synsets associated with the word ‘koala’ by using the `wn.synsets()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('koala.n.01')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synsets('koala')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The preceding code tells us that there is exactly one synset associated with the word ‘koala’, and that the synset is called 'koala.n.01'."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting the definition of a synset\n",
    "\n",
    "Synsets are units of meaning, and we can ask for a definition of a synset by using the `.definition()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'sluggish tailless Australian arboreal marsupial with grey furry ears and coat; feeds on eucalyptus leaves and bark'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synset('koala.n.01').definition()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting the lexemes associated with a synset\n",
    "\n",
    "As we write above, synsets, as units of meaning, are represented in a text by lexemes, and the combination of a synset (a meaning) plus a lexeme (a word) is called a **lemma**. We can get the lemmata for a particular synset by asking for them with the `.lemmas()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Lemma('koala.n.01.koala'),\n",
       " Lemma('koala.n.01.koala_bear'),\n",
       " Lemma('koala.n.01.kangaroo_bear'),\n",
       " Lemma('koala.n.01.native_bear'),\n",
       " Lemma('koala.n.01.Phascolarctos_cinereus')]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synset('koala.n.01').lemmas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that a lemma like 'koala.n.01.koala' combines the synset representation (“koala.n.01”) with a lexeme that expresses that meaning (“koala”). You can get just the lexical part, without the synset prefix, by applying the `name()` method to a lemma. Here we ask for the first (zeroeth in Python enumeration) lemma associated with our synset and return just its name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'koala'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synset('koala.n.01').lemmas()[0].name()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What about inflected forms?\n",
    "\n",
    "As noted above, we can identify all of the synsets associated with a word by using the `wn.synsets()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('koala.n.01')]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synsets('koala')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The word that we use as an argument to the `wn.synsets()` function doesn’t have to be the dictionary form, which for nouns is a typically a singular. We’ll get the same result if we ask for the synsets associated with the plural:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('koala.n.01')]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synsets('koalas')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see above that the lexeme “koala” (whether represented by its singular or plural form) belongs to only one synset. The word “ghost”, though, belongs to seven, four of which are nouns and three of which are verbs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('ghost.n.01'),\n",
       " Synset('ghostwriter.n.01'),\n",
       " Synset('ghost.n.03'),\n",
       " Synset('touch.n.03'),\n",
       " Synset('ghost.v.01'),\n",
       " Synset('haunt.v.02'),\n",
       " Synset('ghost.v.03')]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synsets('ghost')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Synset summary\n",
    "\n",
    "* A word may represent multiple meanings, and we get the meanings with `wn.synsets()`.\n",
    "* We can get a definition of a synset with `.definition()` .\n",
    "* We can get the lemmata (combination of a lexeme with a meaning) associated with a synset with `.lemmas()`.\n",
    "* We can get just the lexical part of a lemma with `.name()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Wordnet to explore course project data\n",
    "\n",
    "For this tutorial, assume that we’re interested in words that express scary concepts. This is close to the actual focus of [Victorian ghost stories](http://ghost.obdurodon.org); for the other project this semester, [Dante’s _Inferno_](http://dante.obdurodon.org), assume that we’re interested in painful concepts instead of scary ones. Pitt-Greensburg students on the [Eldritch team](https://github.com/PPH3/Eldritch) are investigating words in H. P. Lovecraft's writings that convey an impression of the bizarre and arcane. The project teams have already tagged the interesting words already using manual methods, but we’re assuming that they are all tagged only in a simple way, along the lines of `<spooky_word>ghost</spooky_word>`. This initial markup makes it possible to find the words we care about easily, but it doesn’t tell us what they mean beyond the fact that they’re associated with scariness.\n",
    "\n",
    "We can begin our richer exploration of meaning by compiling a list of sample words and examining their synsets. In the example below we’ve included four spooky words plus one non-spooky control item:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[Synset('panic.n.02'),\n",
       "  Synset('scare.n.02'),\n",
       "  Synset('frighten.v.01'),\n",
       "  Synset('daunt.v.01')],\n",
       " [Synset('ghost.n.01'),\n",
       "  Synset('ghostwriter.n.01'),\n",
       "  Synset('ghost.n.03'),\n",
       "  Synset('touch.n.03'),\n",
       "  Synset('ghost.v.01'),\n",
       "  Synset('haunt.v.02'),\n",
       "  Synset('ghost.v.03')],\n",
       " [Synset('fear.n.01'), Synset('frighten.v.01')],\n",
       " [Synset('creep.n.01'), Synset('ghost.n.01'), Synset('spook.v.01')],\n",
       " [Synset('koala.n.01')]]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import wordnet as wn # import Wordnet and call it just “wn” for brevity\n",
    "words = ['scare', 'ghost', 'fright', 'spook', 'koala'] # create a list of words to examine\n",
    "synset_list =[wn.synsets(word) for word in words] # get the synsets for each word\n",
    "synset_list # display them"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output above is a list of lists, where each of the inner lists contains the synsets that pertain to a particular word form. We can see that the first inner list shows the four synsets associated with the word “scare”, the second inner list shows the seven synsets associated with the word “ghost”, etc. Our assumption is that each word taken from a text is associated, _in the context in which it occurs_, with exactly one meaning represented by one of the available synsets. The part about context matters; the same lexeme may occur in different contexts with different meanings within the same text. For example, as noted above, the word “scare” may be a noun in one place and a verb in another.\n",
    "\n",
    "Occasionally your texts may contain words that are not included in Wordnet, or words that are used with meanings that are not represented in Wordnet. You cannot add anything to Wordnet, so when that happens, make a note of it, but otherwise you’ll have to exclude those words from your Wordnet processing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add synset markup to your documents\n",
    "\n",
    "So far your data contains nothing more than a tag that identifies spooky words, e.g., `<spooky_word>ghost</spooky_word>`. Your goal here is to identify the synset represented by the word in its context and add an attribute (`@synset`) to the markup, using a value that identifies the synset. This task requires human analysis, since although Wordnet can tell you the possible synsets for a particular lexeme, it can’t tell which of those available meanings the lexeme has at a particular location in the text. Remember that the same word form may represent different synsets in different locations. For example, as noted above, “scare” could be a noun in one place and a verb in a different place, and those are different synsets. You don’t need to do this for your entire corpus, which wouldn’t be realistic given the fifteen-week semester and the size of the corpus, but you’ll want to do enough to get a sense of the relationship between word forms in your corpus and the synsets that Wordnet uses to represent units of meaning.\n",
    "\n",
    "The procedure for adding synset markup to the document has three steps:\n",
    "\n",
    "1. Get the definitions of each synset for each scary word in your corpus or selection. You can use Python to do this.\n",
    "1. Choose the appropriate synset for each scary word in your corpus or selection. This requires human decisions, since Python doesn’t understand the context.\n",
    "1. Write the correct synset into the markup as a new `@synset` attribute. You have to do this manually, as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Get the definitions of each synset for each word\n",
    "\n",
    "You can get the definition of a synset like `Synset('panic.n.02')` with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'sudden mass fear and anxiety over anticipated events'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synset('panic.n.02').definition()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A lexeme like “scare” is associated with four synsets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('panic.n.02'),\n",
       " Synset('scare.n.02'),\n",
       " Synset('frighten.v.01'),\n",
       " Synset('daunt.v.01')]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synsets('scare')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each occurrence of some form of “scare” in our texts (it might be ‘scare’ or ‘scares’ or some other inflected form), we want to add an attribute to our XML that indicates the appropriate synset. To tell the synsets apart (in case the sample word that’s part of the synset identifier is not sufficiently clear by itself), we can get their definitions. The code below outputs each synset and its definition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Synset('panic.n.02') means: sudden mass fear and anxiety over anticipated events\",\n",
       " \"Synset('scare.n.02') means: a sudden attack of fear\",\n",
       " \"Synset('frighten.v.01') means: cause fear in\",\n",
       " \"Synset('daunt.v.01') means: cause to lose courage\"]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[str(item) + ' means: ' + item.definition() for item in wn.synsets('scare')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the `str()` function above to stringify the synset (represented by the variable `item`) so that we can concatenate it with the other strings for output."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Choose the appropriate synset for each spooky word _in context_\n",
    "\n",
    "Once you know the synsets that are available for each word in your document, look at your XML and choose the appropriate synset for each word _in context_. For example, if “scare” occurs as a verb that means ‘cause fear in’ in one place, the synset you‘d choose from above would be 'frighten.v.01'. If it occurs as a noun that means ‘a sudden attack of fear’ in another, you‘d choose 'scare.n.02'."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Write the synset information back into the XML\n",
    "\n",
    "You can’t write the synset information back into the XML automatically because the same word form in the XML might belong to different synsets in different locations (like the use of ‘scare’ as a verb or as a noun, described above). For that reason, you’ll want to add the synset value manually to the tagged words in your XML. For example, if you have:\n",
    "\n",
    "```xml\n",
    "<p>He <spooky_word>scared</spooky_word> them.</p>\n",
    "```\n",
    "\n",
    "You would expand the markup to:\n",
    "\n",
    "```xml\n",
    "<p>He <spooky_word synset=\"frighten.v.01\">scared</spooky_word> them.</p>\n",
    "```\n",
    "\n",
    "The easiest way to add this type of markup is to load the document into &lt;oXygen/&gt; and do a search and replace to search for the string \n",
    "\n",
    "```xml\n",
    "<spooky_word\n",
    "```\n",
    "\n",
    "and replace it with\n",
    "\n",
    "```xml\n",
    "<spooky_word synset=\"\"\n",
    "```\n",
    "\n",
    "This will write the `@synset` attribute into the start tag with a null value, and you can then use the XPath browser box to find all `<spooky_word>` elements (using the XPath expression `//spooky_word`) and type in the attribute values. You’ll want to modify your schema so that this new attribute will be valid."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examine the lemmata for each synset\n",
    "\n",
    "At the moment this is just for curiosity. Below we construct a list of two synsets and for each of them we print the Wordnet synset identifier and a list of the lexemes associated with it. As described above, we use the `.lemmas()` method to get the lemmata associated with the synset and we use the `.name()` method to keep only the lexical part of the lemma:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Synset('scare.n.02') has the following lemmata: ['scare', 'panic_attack']\n",
      "Synset('frighten.v.01') has the following lemmata: ['frighten', 'fright', 'scare', 'affright']\n"
     ]
    }
   ],
   "source": [
    "scare_synsets = [wn.synset('scare.n.02'), wn.synset('frighten.v.01')]\n",
    "for synset in scare_synsets:\n",
    "    print(str(synset) + ' has the following lemmata: ' + str([lemma.name() for lemma in synset.lemmas()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tasks\n",
    "\n",
    "Once you’ve added the new `@synset` attributes to your XML, as described above, here are some tasks you can perform to explore them. Linguistics students who need to meet the Linguistics Department “methods” requirement should choose one or two of the following tasks. You don’t need to process your entire corpus, which wouldn’t be realistic in the context of a one-semester course, and for the same reason you don’t need to implement all of the suggested tasks. But you want to do enough to get a sense of what Wordnet can tell you about the semantics of your documents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Explore lexical ambiguity\n",
    "\n",
    "Word forms in your text will belong to zero or more synsets, although an _occurrence_ of a word form will belong to only one synset _in its particular context_. You can quantify the degree of lexical ambiguity, and thus the extent to which the meaning of the word depends on context, by retrieving the number of synsets for each word form in your data. Note that the focus here is on lexical ambiguity, that is, the meanings that a word could have in isolation. This is different from the contextual ambiguity that might interest scholars of literature, where ambiguity _in a specific context_ (that is,  ambiguity that persists even in a particular context) might be used to express irony or for other rhetorical purposes.\n",
    "\n",
    "One way to think about lexical ambiguity from a Wordnet perspective is to count the synsets that a word can represent. Here’s how to do that (using the `words` variable we created above, which is equal to a list of five specific words):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The word \"scare\" belongs to 4 synsets\n",
      "The word \"ghost\" belongs to 7 synsets\n",
      "The word \"fright\" belongs to 2 synsets\n",
      "The word \"spook\" belongs to 3 synsets\n",
      "The word \"koala\" belongs to 1 synsets\n"
     ]
    }
   ],
   "source": [
    "for word in words:\n",
    "    synset_count = len(wn.synsets(word))\n",
    "    print('The word \"' + word + '\" belongs to ' + str(synset_count) + ' synsets')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The preceding output is fine for humans, but we want to write these counts back into our XML. We can do that automatically in three steps:\n",
    "\n",
    "1. Use XSLT to export a plain text list of words you’ve tagged (e.g., spooky words) from your XML data files.\n",
    "1. Use Python to create an XML auxiliary document that maps each of those words to its synset count. The Python script will read the exported plain text list, use Wordnet to count the number of synsets associated with each of them, and write the word plus the count into the new XML document.\n",
    "1. Use an XSLT _identity transformation_ to write the synset count into the XML as new content. Your XSLT transformation will transform each of your XML data files to itself (that is, the output will be identical to the input), except that it will insert an additional `@synset_count` attribute that includes the count of synsets associated with the word form.\n",
    "\n",
    "Here’s how that works:\n",
    "\n",
    "#### Step 1: Export a plain text list of words you’ve tagged (e.g., spooky words)\n",
    "\n",
    "Here’s some original sample XML:\n",
    "\n",
    "```xml\n",
    "<root>\n",
    "    <p>The <spooky_word>ghost</spooky_word> <spooky_word>scared</spooky_word> \n",
    "    him by giving him a <spooky_word>scare</spooky_word>.</p>\n",
    "</root>\n",
    "```\n",
    "\n",
    "In this XML we have tagged the spooky words. We then manually add the synset markup, as described above:\n",
    "\n",
    "```xml\n",
    "<root>\n",
    "    <p>The <spooky_word synset=\"ghost.n.03\">ghost</spooky_word> \n",
    "    <spooky_word synset=\"frighten.v.01\">scared</spooky_word> \n",
    "    him by giving him a \n",
    "    <spooky_word synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
    "</root>\n",
    "```\n",
    "\n",
    "We then run the following XSLT transformation, outputting plain text:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n",
    "    xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" exclude-result-prefixes=\"xs\" version=\"2.0\">\n",
    "    <xsl:output method=\"text\" indent=\"yes\"/>\n",
    "    <xsl:template match=\"/\">\n",
    "        <xsl:apply-templates select=\"//spooky_word\"/>\n",
    "    </xsl:template>\n",
    "    <xsl:template match=\"spooky_word\">\n",
    "        <xsl:value-of select=\"concat(.,'&#x0a;')\"/>\n",
    "    </xsl:template>\n",
    "</xsl:stylesheet>\n",
    "```\n",
    "\n",
    "Note that the value of the `@method` attribute on the `<xsl:output>` element is \"text\" because we’re creating plain text. We apply templates to the `<spooky_word>` elements, and in the template that matches those elements, we output the value of concatenating the content of the element (the word itself) with a new line character (spelled `&#x0a;`, which is the _numerical character reference_ for a new line). The output looks like:\n",
    "\n",
    "    ghost\n",
    "    scared\n",
    "    scare\n",
    "\n",
    "We can save that to a file (let’s call it “spooky_words.txt”), so that we can access it later with Python.\n",
    "\n",
    "#### Step 2: Access that file with Python and create a new XML file that maps each word form to its synset count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with open('spooky_words.txt', 'r') as infile: # open the plain text file that contains the list of words\n",
    "    wordlist = infile.read().split() # read the words into a list, splitting on the new lines\n",
    "with open('synset_counts.xml', 'w') as outfile: # open a file to hold the XML output\n",
    "    outfile.write('<root>') # create a start tag for the root element in the output XML file\n",
    "    for word in wordlist: # create output for each word\n",
    "        synset_count = len(wn.synsets(word)) # for each word, count the number of synsets to which it belongs\n",
    "        outfile.write('<word><form>' + word + '</form><count>' + str(synset_count) + '</count></word>') # write it out\n",
    "    outfile.write('</root>') # create the end tag for the root element"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We saved the output to a file called synset\\_counts.xml, so we don’t see it here in the notebook, but we can now use Python to read it. This is just for human inspection, to make sure that it looks the way we want. It isn’t pretty-printed, but we can still see how it looks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<root><word><form>ghost</form><count>7</count></word><word><form>scared</form><count>4</count></word><word><form>scare</form><count>4</count></word></root>\n"
     ]
    }
   ],
   "source": [
    "with open('synset_counts.xml') as infile:\n",
    "    print(infile.read())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 3: To write the counts back into the XML, use an _identity transformation_, reading in the new count file with the XPath `document()` function\n",
    "\n",
    "Assume that we’ve saved our original XML (with the synsets, but without the counts) as original.xml. It looks like:\n",
    "\n",
    "```xml\n",
    "<root>\n",
    "    <p>The <spooky_word synset=\"ghost.n.03\">ghost</spooky_word> \n",
    "    <spooky_word synset=\"frighten.v.01\">scared</spooky_word> \n",
    "    him by giving him a \n",
    "    <spooky_word synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
    "</root>\n",
    "```\n",
    "\n",
    "Transform it with the following XSLT:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n",
    "    xmlns:xs=\"http://www.w3.org/2001/XMLSchema\"\n",
    "    exclude-result-prefixes=\"xs\"\n",
    "    version=\"2.0\">\n",
    "    <xsl:variable name=\"count_file\" as=\"document-node()\" select=\"document('synset_counts.xml')\"/>\n",
    "    <xsl:template match=\"node()|@*\">\n",
    "        <xsl:copy>\n",
    "            <xsl:apply-templates select=\"@*|node()\"/>\n",
    "        </xsl:copy>\n",
    "    </xsl:template>\n",
    "    <xsl:template match=\"spooky_word\">\n",
    "        <xsl:copy>\n",
    "            <xsl:attribute name=\"synset_count\" select=\"$count_file//word[form eq current()]/count\"/>\n",
    "            <xsl:apply-templates select=\"@*|node()\"/>\n",
    "        </xsl:copy>\n",
    "    </xsl:template>\n",
    "</xsl:stylesheet>\n",
    "```\n",
    "\n",
    "The `document()` function opens synset\\_counts.xml (which we created with Python in Step #2) so that we can access it (using the variable name `$count_file`) while we’re transforming original.xml. The first template is an _identity transformation_, which you can read about at https://en.wikipedia.org/wiki/Identity_transform. When you perform an identity transformation, the identity template transforms everything to itself (that is, the output is a copy of the input), except that you write separate templates only for the bits that you want to change. In this case, we’re changing `<spooky_word>` elements to add a new `@synset_count` attribute, inserting the value it copies from the auxiliary file that we created with Python in the preceding step.\n",
    "\n",
    "Here’s the output of that last transformation:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<root>\n",
    "    <p>The <spooky_word synset_count=\"7\" synset=\"ghost.n.03\">ghost</spooky_word> \n",
    "        <spooky_word synset_count=\"4\" synset=\"frighten.v.01\">scared</spooky_word> \n",
    "        him by giving him a \n",
    "        <spooky_word synset_count=\"4\" synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
    "</root>\n",
    "```\n",
    "\n",
    "We can then calculate the extent of ambiguity for the entire document or for each individual paragraph. Note that we do not need to have identified a particular synset for each spooky word in order to determine the extent of ambiguity; that is, you do not actually need the `@synset` attributes shown in the example. All that is required is a count of all the available synsets for each word. We might decide that the ambiguity of a paragraph is the average of all of the `@synset_count` values in that paragraph, so that for the sole paragraph here it would be 5, that is, the sum of the three values (15) divided by the number of values (3). We could graph this with SVG to examine whether there’s a pattern to the ambiguity, that is, whether it’s higher in some locations of the story than in others. We could also look for correlations between, say, the number of spooky words and the degree of ambiguity. Or we could compare stories or authors to see whether one there is any regularity or other pattern in the ambiguity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Determine the number of representations of each synset in each document\n",
    "\n",
    "You can use XSLT to determine which synsets are favored in which texts or by which authors or at which periods. Consider the following input document:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<root>\n",
    "    <p>The <spooky_word synset=\"ghost.n.03\">ghost</spooky_word>\n",
    "        <spooky_word synset=\"frighten.v.01\">scared</spooky_word> and <spooky_word\n",
    "            synset=\"frighten.v.01\">frightened</spooky_word> him by giving him a <spooky_word\n",
    "            synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
    "</root>\n",
    "```\n",
    "\n",
    "This has four spooky words representing three different synsets. We can count the number of occurrences of each synset using XSLT:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n",
    "    xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" exclude-result-prefixes=\"xs\" version=\"2.0\">\n",
    "    <xsl:output method=\"xml\" indent=\"yes\"/>\n",
    "    <xsl:variable name=\"root\" select=\"/\"/>\n",
    "    <xsl:template match=\"/\">\n",
    "        <data>\n",
    "            <xsl:for-each select=\"distinct-values(//spooky_word/@synset)\">\n",
    "                <synset_count>\n",
    "                    <synset>\n",
    "                        <xsl:value-of select=\"current()\"/>\n",
    "                    </synset>\n",
    "                    <count>\n",
    "                        <xsl:value-of select=\"count($root//spooky_word[@synset eq current()])\"/>\n",
    "                    </count>\n",
    "                </synset_count>\n",
    "            </xsl:for-each>\n",
    "        </data>\n",
    "    </xsl:template>\n",
    "</xsl:stylesheet>\n",
    "```\n",
    "\n",
    "We set a variable called `$root` because when we do `<xsl:for-each>` over distinct values, we cut ourselves off from the tree, so if we want to get back to it, we need to access it through that variable. Here we get each distinct `@synset` value and count the number of `<spooky_word>` elements that have a `@synset` attribute with that value. In this case the output is:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<data>\n",
    "   <synset_count>\n",
    "      <synset>ghost.n.03</synset>\n",
    "      <count>1</count>\n",
    "   </synset_count>\n",
    "   <synset_count>\n",
    "      <synset>frighten.v.01</synset>\n",
    "      <count>2</count>\n",
    "   </synset_count>\n",
    "   <synset_count>\n",
    "      <synset>scare.n.02</synset>\n",
    "      <count>1</count>\n",
    "   </synset_count>\n",
    "</data>\n",
    "```\n",
    "\n",
    "We could transform that to HTML to SVG for display. The counts let us ask: do some works or authors show a preference for certain synset expressions of spookiness?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Explore the richness of the expression of spookiness\n",
    "\n",
    "Since we’ve already assigned a synset to each spooky word in our text, we can count the number of different synsets in the text. Do some writers represent spookiness with a greater range of spooky-related meanings, that is, with more synsets, than other writers? Because texts may be of different length, we might want not just to count the number of different synsets, but to express the value as the result of dividing the number of spooky word instances by the number of distinct synsets. We can do that with XSLT and write the result into the document as metadata, performing another identity transformation and this time just adding the count in a new element. Assume our input is the output of the last operation, that is:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<root>\n",
    "    <p>The <spooky_word synset_count=\"7\" synset=\"ghost.n.03\">ghost</spooky_word> \n",
    "        <spooky_word synset_count=\"4\" synset=\"frighten.v.01\">scared</spooky_word> \n",
    "        him by giving him a \n",
    "        <spooky_word synset_count=\"4\" synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
    "</root>\n",
    "```\n",
    "\n",
    "Apply the following XSLT transformation:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n",
    "    xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" exclude-result-prefixes=\"xs\" version=\"2.0\">\n",
    "    <xsl:output method=\"xml\" indent=\"yes\"/>\n",
    "    <xsl:template match=\"node() | @*\">\n",
    "        <xsl:copy>\n",
    "            <xsl:apply-templates select=\"@* | node()\"/>\n",
    "        </xsl:copy>\n",
    "    </xsl:template>\n",
    "    <xsl:template match=\"root\">\n",
    "        <xsl:copy>\n",
    "            <meta>\n",
    "                <spookiness_ratio>\n",
    "                    <xsl:value-of\n",
    "                        select=\"count(distinct-values(//spooky_word/@synset)) div count(//spooky_word)\"\n",
    "                    />\n",
    "                </spookiness_ratio>\n",
    "            </meta>\n",
    "            <xsl:apply-templates/>\n",
    "        </xsl:copy>\n",
    "    </xsl:template>\n",
    "</xsl:stylesheet>\n",
    "```\n",
    "\n",
    "We start with the identity transformaiton, but when we match our root element (which we’ve arbitrarily called `<root>`), before we apply templates (that is, process its contents) we create a new `<meta>` child, which contains a `<spookiness_ratio>` element, and we calculate and insert the value there. In this case it turns out to be 1 because there are three `<spooky_word>` elements and three distinct `@synset` values. The fewer the number of synsets, the lower the value will be. If we use our sample input from above:\n",
    "\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<root>\n",
    "    <p>The <spooky_word synset=\"ghost.n.03\">ghost</spooky_word>\n",
    "        <spooky_word synset=\"frighten.v.01\">scared</spooky_word> and <spooky_word\n",
    "            synset=\"frighten.v.01\">frightened</spooky_word> him by giving him a <spooky_word\n",
    "            synset=\"scare.n.02\">scare</spooky_word>.</p>\n",
    "</root>\n",
    "```\n",
    "\n",
    "and run the same transformation, the value is 0.75 because there are four spooky words and three distinct synsets. Note that if we apply this transformation to a document with no spooky words, we’ll throw an error because we would be dividing by zero. If we know that we won’t apply our transformation to any such documents, we can ignore the risk, but a less brittle strategy might trap the error, report it gracefully, and terminate cleanly, instead of falling back on the default Python error handling.\n",
    "\n",
    "Calculating the result of dividing the number of distinct synsets by the count of spooky words can be analogized to the _type/token ratio_ in corpus linguistics. In a type/token ratio, types are _distinct_ items (such as _different_ words in a text) and tokens are the items (such as words in the same text, regardless of whether they’re duplicates of other words that we’ve already seen). A high type/token ratio means that the text is lexically varied, with little repetition of words. A low ratio means a less varied vocabulary. In this case the number of distinct synsets is our type count and the number of spooky words is our token count. A high type/token ratio means that spookiness is expressed in a wider variety of ways; a value of 1 would mean that no synset is repeated. A low ratio would mean that the variety is less; the value cannot be 0 if there’s any spookiness at all, but the least varied possibility is that there are lots of spooky words, but they all represent the same synset.\n",
    "\n",
    "Type/token ratios are sensitive to text length. This is easiest to see at the extreme: the number of distinct words in a language may be very large, but it isn’t infinite (at least, it isn’t infinite in any real language context), while texts can be arbitrarily long. That means that after your text reaches a certain length, you don’t know any words you haven’t used already, so if you’re going to make the text any longer, you have to start repeating words you’ve already used before. The dependence of type/token ratio on text length means that if you want to compare type/token ratios, you can do that meaningfully only for texts of the same length. For that reason, if you want to compare our spookiness analogy across texts, you should use texts of the same length. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Explore the richness of the vocabulary (by writer or by text)\n",
    "\n",
    "Synsets are represented by one or more lemmata, which you can retrieve with the `lemmas()` method, as in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Synset('ghost.n.01') means \"a mental representation of some haunting experience\" and has 6 lemmata: ['ghost', 'shade', 'spook', 'wraith', 'specter', 'spectre']\n",
      "Synset('ghostwriter.n.01') means \"a writer who gives the credit of authorship to someone else\" and has 2 lemmata: ['ghostwriter', 'ghost']\n",
      "Synset('ghost.n.03') means \"the visible disembodied soul of a dead person\" and has 1 lemmata: ['ghost']\n",
      "Synset('touch.n.03') means \"a suggestion of some quality\" and has 3 lemmata: ['touch', 'trace', 'ghost']\n",
      "Synset('ghost.v.01') means \"move like a ghost\" and has 1 lemmata: ['ghost']\n",
      "Synset('haunt.v.02') means \"haunt like a ghost; pursue\" and has 3 lemmata: ['haunt', 'obsess', 'ghost']\n",
      "Synset('ghost.v.03') means \"write for someone else\" and has 2 lemmata: ['ghost', 'ghostwrite']\n"
     ]
    }
   ],
   "source": [
    "synsets = wn.synsets('ghost')\n",
    "for synset in synsets:\n",
    "    lemmata = synset.lemmas()\n",
    "    print(str(synset) + ' means \"' + synset.definition() + '\" and has ' + str(len(lemmata)) + ' lemmata: ' + \\\n",
    "         str([lemma.name() for lemma in lemmata]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the `name()` method to get just the lexical part of the lemma. (We took a lazy way out and used the plural “lemmata” even after the value 1, although “has 1 lemmata” should really read “has 1 lemma”. If we intended to use this code to produce final output for end-users, we’d include additional code to control for that difference.)\n",
    "\n",
    "A writer or text that uses the synset 'ghost.n.01' has six lemmata available to express that meaning. What proportion of the available vocabulary does your writer or text use? \n",
    "\n",
    "That would be easy to calculate if the writer always used the exact form provided by the `name()` method of lemmata. For example, you might find that a particular text contains the following mappings of lemmata and word forms:\n",
    "\n",
    "Synset | Word form\n",
    "--- | ---\n",
    "ghost.n.01 | ghost\n",
    "ghost.n.01 | shade\n",
    "ghost.n.01 | spook\n",
    "\n",
    "You can count up the number of word forms associated with each synset, and because each word form corresponds to a different one of the 6 lemmata for that synset, you’ll determine correctly that the writer or text uses 50% of the available lemmata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 6 lemmata for ghost.n.01 and they are: ['ghost', 'shade', 'spook', 'wraith', 'specter', 'spectre']\n",
      "The 3 lemmata for ghost.n.01 used in the document are ['ghost', 'shade', 'spook']\n",
      "The ratio of used (3) divided by available (6) = 0.5\n"
     ]
    }
   ],
   "source": [
    "available = [lemma.name() for lemma in wn.synset('ghost.n.01').lemmas()]\n",
    "print('There are ' + str(len(available)) + ' lemmata for ghost.n.01 and they are: ' + str(available))\n",
    "used = ['ghost', 'shade', 'spook']\n",
    "print('The 3 lemmata for ghost.n.01 used in the document are ' + str(used))\n",
    "print('The ratio of used (' + str(len(used)) + ') divided by available (' + \\\n",
    "      str(len(available)) + ') = ' + str(len(used) / len(available)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But suppose the forms include different inflections of the same lemma, such as singular 'ghost' and plural 'ghosts'. The challenge here is that those two forms represent the same lemma, and the `.lemmas()` won’t return the plural form “ghosts”, so you can’t simply count forms that occur in the text and use that as a surrogate for counting lemmata that occur in the text. Wordnet helps resolve these situations with `wn.morphy()`, which lemmatizes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The result of applying wn.morphy() to “ghost” (sg) is “ghost”\n",
      "The result of applying wn.morphy() to “ghosts” (pl) is ghost”\n"
     ]
    }
   ],
   "source": [
    "print('The result of applying wn.morphy() to “ghost” (sg) is “' + wn.morphy('ghost') + '”')\n",
    "print('The result of applying wn.morphy() to “ghosts” (pl) is ' + wn.morphy('ghosts') + '”')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This means that we can resolve the variation caused by inflection along the following lines. In the code snippet below we have the same list as above, except that instead of three items in our `used` variable that correspond to three different lemmata, we have three items that correspond to only two different lemmata. Here we print the values of `used` and `normalized` to show that they have the same length (the same number of items), but `normalized` has only two _distinct_ values, while `used` has three. We then convert `normalized` from a list (which allows duplicates) to a set (which doesn’t), which is a quick way of removing duplicates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ghost', 'ghosts', 'spook']\n",
      "['ghost', 'ghost', 'spook']\n",
      "0.3333333333333333\n"
     ]
    }
   ],
   "source": [
    "available = [lemma.name() for lemma in wn.synset('ghost.n.01').lemmas()]\n",
    "used = ['ghost', 'ghosts', 'spook']\n",
    "normalized = [wn.morphy(item) for item in used]\n",
    "print(used)\n",
    "print(normalized)\n",
    "print(len(set(normalized)) / len(available))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By using `wn.morphy()`, then, we can determine the richness of the vocabulary (the number of available different lemmata that are actually used) without being misled by different inflected forms of the same lexeme. Of course we still have to decide how to use these counts to explore or present information about how much of the available vocabulary variation the writer or text actually uses."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Words and phrases\n",
    "\n",
    "Wordnet is primarily about words, and it contains little information about phrases. To the extent that spookiness is expressed in a phrase, if the spooky quality of the phrase depends on a particular word, it may be more useful to tag the word than the entire phrase, since you can look up the word in Wordnet. But in the case of idioms and other phrasal expressions, the spooky quality may not belong to any single word, and in that case you have to tag the entire phrase. You won’t find spooky phrases in Wordnet, but you can use XSLT to determine how much spooky meaning is expressed at a phrasal level that cannot be regarded as obtaining its spookiness from specific individual words. The XSLT for this is so easy that we won’t write out the code; you can retrieve all of the `<spooky_word>` elements and distinguish the ones that are phrases from the ones that aren’t by filtering them with `matches()`, along the lines of:\n",
    "\n",
    "    //spooky_word[matches(., '\\s')]\n",
    "\n",
    "This retrieves all `<spooky_word>` elements and filters them to retain only the ones that match a regex pattern that includes a white space character (`'\\s'`). You can read about the `matches()` function in Michael Kay. We use `matches()` instead of `contains()` because `contains()` looks at strings, and the words of a phrase may usually be separated by a space character, but they could, alternatively, be separated by a new line character, which isn’t the same string as a space character. This means that if we use `contains()` to check for phrases that contain a space character, we’ll miss the phrases that contain a new line instead. But because `matches()` uses regex where `contains()` uses strings, `matches()` can ask “does this item contain any white space character, whether it’s a space character or a new line?”\n",
    "\n",
    "How you use this information is up to you, but it will tell you how much a writer or text depends on spooky words, and how much the spookiness is conveyed at a phrasal, rather than lexical level."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}