### The following codes are from Natural Language Processing with Python, By Steven Bird, Ewan Klein, and Edward Loper. O' Reilly Media, 978-0-596-51649-9 

## 1   Accessing Text Corpora

### 1.1 Gutenberg Corpus

#### NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. 

In [1]:
import nltk

In [2]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

#### Let's pick out the first of these texts — Emma by Jane Austen — and give it a short name, emma, then find out how many words it contains:

In [31]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]

In [4]:
len(emma)

192427

#### When we defined emma, we invoked the words() function of the gutenberg object in NLTK's corpus package. But since it is cumbersome to type such long names all the time, Python provides another version of the import statement, as follows

In [5]:
from nltk.corpus import gutenberg

In [6]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

#### See the specific content in each text

In [7]:
gutenberg

<PlaintextCorpusReader in 'C:\\Users\\geoff\\AppData\\Roaming\\nltk_data\\corpora\\gutenberg'>

#### Follow your own path, find gutenbreg corpus in your local computer

In [8]:
emma=gutenberg.words('austen-emma.txt')

In [9]:
emma[0:20]

['[',
 'Emma',
 'by',
 'Jane',
 'Austen',
 '1816',
 ']',
 'VOLUME',
 'I',
 'CHAPTER',
 'I',
 'Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich']

#### Let's write a short program to display other information about each text, by looping over all the values of fileid corresponding to the gutenberg file identifiers listed earlier and then computing statistics for each text. For a compact output display, we will round each number to the nearest integer, using round().

In [10]:
for fileid in gutenberg.fileids():
    num_chars=len(gutenberg.raw(fileid))
    num_words=len(gutenberg.words(fileid))
    num_sents=len(gutenberg.sents(fileid))
    num_vocab=len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words),round(num_words/num_sents),round(num_words/num_vocab),fileid)

5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt


#### This program displays three statistics for each text:

#### 1) average word length: num_chars/num_words

#### 2) average sentence length: num_words/num_sents

#### 3) the number of times each vocabulary item appears in the text on average (our lexical diversity score): num_words/num_vocab

#### Observe that average word length appears to be a general property of English, since it has a recurrent value of 4. (In fact, the average word length is really 3 not 4, since the num_chars variable counts space characters.)

#### By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.

#### The previous example also showed how we can access the "raw" text of the book, not split up into tokens. 

#### The raw() function gives us the contents of the file without any linguistic processing.  It tells us how many letters occur in the text, including the spaces between words. 

#### The sents() function divides the text up into its sentences, where each sentence is a list of words:

In [11]:
macbeth_sentences=gutenberg.sents('shakespeare-macbeth.txt')

In [12]:
macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

In [13]:
macbeth_sentences[1116]

['Double',
 ',',
 'double',
 ',',
 'toile',
 'and',
 'trouble',
 ';',
 'Fire',
 'burne',
 ',',
 'and',
 'Cauldron',
 'bubble']

In [14]:
longest_len=max(len(s) for s in macbeth_sentences)
longest_len

158

In [15]:
[s for s in macbeth_sentences if len(s)==longest_len]

[['Doubtfull',
  'it',
  'stood',
  ',',
  'As',
  'two',
  'spent',
  'Swimmers',
  ',',
  'that',
  'doe',
  'cling',
  'together',
  ',',
  'And',
  'choake',
  'their',
  'Art',
  ':',
  'The',
  'mercilesse',
  'Macdonwald',
  '(',
  'Worthie',
  'to',
  'be',
  'a',
  'Rebell',
  ',',
  'for',
  'to',
  'that',
  'The',
  'multiplying',
  'Villanies',
  'of',
  'Nature',
  'Doe',
  'swarme',
  'vpon',
  'him',
  ')',
  'from',
  'the',
  'Westerne',
  'Isles',
  'Of',
  'Kernes',
  'and',
  'Gallowgrosses',
  'is',
  'supply',
  "'",
  'd',
  ',',
  'And',
  'Fortune',
  'on',
  'his',
  'damned',
  'Quarry',
  'smiling',
  ',',
  'Shew',
  "'",
  'd',
  'like',
  'a',
  'Rebells',
  'Whore',
  ':',
  'but',
  'all',
  "'",
  's',
  'too',
  'weake',
  ':',
  'For',
  'braue',
  'Macbeth',
  '(',
  'well',
  'hee',
  'deserues',
  'that',
  'Name',
  ')',
  'Disdayning',
  'Fortune',
  ',',
  'with',
  'his',
  'brandisht',
  'Steele',
  ',',
  'Which',
  'smoak',
  "'",
  'd',
 

#### Exercise 0. Another way to calcuate average word length and sentence length for each text in the gutenberg corpus. Average word length= sum of each word length/the number of words; Average sentence length= sum of each sentence length/the number of sentences

### 1.2 Web and Chat Text

#### Although Project Gutenberg contains thousands of books, it represents established literature. It is important to consider less formal language as well. NLTK's small collection of web text includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews:

In [16]:
from nltk.corpus import webtext

In [17]:
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65],'...')

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...


#### There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators.

#### The corpus contains over 10,000 posts, anonymized by replacing usernames with generic names of the form "UserNNN", and manually edited to remove any other identifying information

#### The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom). 

#### The filename contains the date, chatroom, and number of posts; e.g., 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on 10/19/2006.

In [18]:
from nltk.corpus import nps_chat

In [19]:
chatroom=nps_chat.posts('10-19-20s_706posts.xml')

In [20]:
chatroom[123]

['i',
 'do',
 "n't",
 'want',
 'hot',
 'pics',
 'of',
 'a',
 'female',
 ',',
 'I',
 'can',
 'look',
 'in',
 'a',
 'mirror',
 '.']

### 1.3 Brown Corpus

#### The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on

#### Table 1.1 Example Document for Each Section of the Brown Corpus 

In [21]:
#ID 	File	Genre	Description
#A16	ca16	news	Chicago Tribune: Society Reportage
#B02	cb02	editorial	Christian Science Monitor: Editorials
#C17	cc17	reviews 	Time Magazine: Reviews
#D12	cd12	religion	Underwood: Probing the Ethics of Realtors
#E36	ce36	hobbies 	Norling: Renting a Car in Europe
#F25	cf25	lore	Boroff: Jewish Teenage Culture
#G22	cg22	belles_lettres	Reiner: Coping with Runaway Technology
#H15	ch15	government	US Office of Civil and Defence Mobilization: The Family Fallout Shelter
#J17	cj19	learned 	Mosteller: Probability with Statistical Applications
#K04	ck04	fiction 	W.E.B. Du Bois: Worlds of Color
#L13	cl13	mystery 	Hitchens: Footsteps in the Night
#M01	cm01	science_fiction	Heinlein: Stranger in a Strange Land
#N14	cn15	adventure	Field: Rattlesnake Ridge
#P12	cp12	romance 	Callaghan: A Passion in Rome
#R06	cr06	humor	Thurber: The Future, If Any, of Comedy

#### We can access the corpus as a list of words, or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

In [22]:
from nltk.corpus import brown

In [23]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [24]:
brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [25]:
brown.words(fileids=['cg22'])

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]

In [26]:
brown.sents(categories=['news','editorial','reviews'])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

#### The Brown Corpus is a convenient resource for studying systematic differences between genres.Let's compare genres in their usage of modal verbs.

#### The first step is to produce the counts for a particular genre.  Remember to import nltk before doing the following:

In [27]:
from nltk.corpus import brown

In [28]:
news_text=brown.words(categories='news')

In [29]:
fdist=nltk.FreqDist(w.lower() for w in news_text)

In [None]:
modals=['can','could','may','might','must','will']

In [None]:
for m in modals:
    print(m+':',fdist[m],end=' ')

#### Exercise 1: Choose a different section of the Brown Corpus (e.g. reviews), and adapt the previous example to count a selection of wh words, such as what, when, where, who, and why.

#### Next, we need to obtain counts for each genre of interest. We'll use NLTK's support for conditional frequency distributions. 

#### These are presented systematically in 2, where we also unpick the following code line by line. 
#### For the moment, you can ignore the details and just concentrate on the output.

In [None]:
cfd=nltk.ConditionalFreqDist(
        (genre,word)
         for genre in brown.categories()
         for word in brown.words(categories=genre))

In [None]:
genres=['news','religion','hobbies','science_fiction','romance','humor']

In [None]:
modals=['can','could','may','might','must','will']

In [None]:
cfd.tabulate(conditions=genres,samples=modals)

#### Observe that the most frequent modal in the news genre is will, while the most frequent modal in the romance genre is could.

### 1.4 Reuters Corpus

#### The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called "training" and "test"; thus, the text with fileid 'test/14826' is a document drawn from the test set. This split is for training and testing algorithms that automatically detect the topic of a document, as we will see in Chapter 6. Learning to Classify Text

In [None]:
from nltk.corpus import reuters

In [None]:
reuters.fileids()

In [None]:
reuters.categories()

#### Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories. 

In [None]:
reuters.categories('training/9865')

In [None]:
reuters.categories(['training/9865', 'training/9880'])

In [None]:
reuters.fileids('barley')

In [None]:
reuters.fileids(['barley', 'corn'])

#### Similarly, we can specify the words or sentences we want in terms of files or categories. The first handful of words in each of these texts are the titles, which by convention are stored as upper case.

In [None]:
reuters.words('training/9865')[:14]

In [None]:
reuters.words(['training/9865', 'training/9880'])

In [None]:
reuters.words(categories='barley')

In [None]:
reuters.words(categories=['barley', 'corn'])

### 1.5   Inaugural Address Corpus

#### In Chapter 1, we looked at the Inaugural Address Corpus, but treated it as a single text. The graph in fig-inaugural used "word offset" as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address. However, the corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension:

In [None]:
from nltk.corpus import inaugural

In [None]:
inaugural.fileids()

#### Notice that the year of each text appears in its filename. To get the year out of the filename, we extracted the first four characters, using fileid[:4].

In [None]:
[fileid[:4] for fileid in inaugural.fileids()]

#### Let's look at how the words America and citizen are used over time.The following code converts the words in the Inaugural corpus to lowercase using w.lower(), then checks if they start with either of the "targets" america or citizen using startswith(). Thus it will count words like American's and Citizens.

#### The condition is either of the words america or citizen, and the counts being plotted are the number of times the word occured in a particular speech. (See Slides)

In [None]:
import nltk

In [None]:
cfd = nltk.ConditionalFreqDist(
              (target,fileid[:4])
              for fileid in inaugural.fileids()
              for w in inaugural.words(fileid)
              for target in ['america', 'citizen']
              if w.lower().startswith(target)) 

In [None]:
a123=[(target,fileid[:4]) for fileid in inaugural.fileids() for w in inaugural.words(fileid) for target in ['america', 'citizen'] if w.lower().startswith(target)]

In [None]:
print(a123[0:9])

In [None]:
cfd1=nltk.ConditionalFreqDist(a123[0:9])

In [None]:
cfd1.tabulate()

In [None]:
cfd.tabulate()

In [None]:
from matplotlib import pyplot as plt

In [None]:
cfd.plot()

####  As we can see from the chart, the word citizen used most frequently in 1841. The word “America” used most frequently in 2017

### 1.6   Annotated Text Corpora (See Slides)

### 1.7   Text Corpus Structure (See Slides)

#### NLTK's corpus readers support efficient access to a variety of corpora, and can be used to work with new corpora. 1.3 lists functionality provided by the corpus readers. We illustrate the difference between some of the corpus access methods below:

In [None]:
#Example	Description
#fileids()	the files of the corpus
#fileids([categories])	the files of the corpus corresponding to these categories
#categories()	the categories of the corpus
#categories([fileids])	the categories of the corpus corresponding to these files
#raw()	the raw content of the corpus
#raw(fileids=[f1,f2,f3])	the raw content of the specified files
#raw(categories=[c1,c2])	the raw content of the specified categories
#words()	the words of the whole corpus
#words(fileids=[f1,f2,f3])	the words of the specified fileids
#words(categories=[c1,c2])	the words of the specified categories
#sents()	the sentences of the whole corpus
#sents(fileids=[f1,f2,f3])	the sentences of the specified fileids
#sents(categories=[c1,c2])	the sentences of the specified categories
#abspath(fileid)	the location of the given file on disk
#encoding(fileid)	the encoding of the file (if known)
#open(fileid)	open a stream for reading the given corpus file
#root	if the path to the root of locally installed corpus
#readme()	the contents of the README file of the corpus

In [None]:
raw = gutenberg.raw("burgess-busterbrown.txt")

In [None]:
raw[1:20]

In [None]:
words = gutenberg.words("burgess-busterbrown.txt")

In [None]:
 words[1:20]

In [None]:
sents = gutenberg.sents("burgess-busterbrown.txt")

In [None]:
sents[1:20]

### 1.8 Loading your own Corpus

#### If you have your own collection of text files that you would like to access using the above methods, you can easily load them with the help of NLTK's PlaintextCorpusReader. 

In [None]:
from nltk.corpus import PlaintextCorpusReader

#### change to your own path

In [None]:
corpus_root= r"C:\Users\dengc\Dropbox\0.Jupyter\my corpus"

In [None]:
wordlists = PlaintextCorpusReader(corpus_root, '.*')

In [None]:
wordlists.fileids()

In [None]:
wordlists.words('text2.txt')

In [None]:
text2=wordlists.words('text2.txt')

In [None]:
len(text2)

In [None]:
text2[2]

#### Exercise 2: Could you please make your own corpus and retrieve the third word from any one of the files in your corpus? 

#### Q & A

### 2 Conditional Frequency Distributions

#### When the texts of a corpus are divided into several categories, by genre, topic, author, etc, we can maintain separate frequency distributions for each category. 

#### This will allow us to study systematic differences between the categories

#### A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition". The condition will often be the category of the text. 

### 2.1 Conditions and Events

#### A conditional frequency distribution needs to pair each event with a condition. So instead of processing a sequence of words, we have to process a sequence of pairs

In [None]:
text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [None]:
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

#### Each pair has the form (condition, event). If we were processing the entire Brown Corpus by genre there would be 15 conditions (one per genre), and 1,161,192 events (one per word).

### 2.2 Counting Words by Genre

#### In Section 1 we saw a conditional frequency distribution where the condition was the section of the Brown Corpus, and for each condition we counted words. Whereas FreqDist() takes a simple list as input, ConditionalFreqDist() takes a list of pairs.

In [None]:
from nltk.corpus import brown

In [None]:
cfd = nltk.ConditionalFreqDist(
         (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))

#### Let's break this down, and look at just two genres, news and romance. For each genre, we loop over every word in the genre, producing pairs consisting of the genre and the word

In [None]:
genre_word = [(genre, word) 
            for genre in ['news', 'romance'] 
            for word in brown.words(categories=genre)] 

In [None]:
len(genre_word)

#### So, as we can see below, pairs at the beginning of the list genre_word will be of the form ('news', word), while those at the end will be of the form ('romance', word)

In [None]:
genre_word[:4]

In [None]:
genre_word[-4:]

#### We can now use this list of pairs to create a ConditionalFreqDist, and save it in a variable cfd. As usual, we can type the name of the variable to inspect it, and verify it has two conditions

In [None]:
cfd=nltk.ConditionalFreqDist(genre_word)

In [None]:
cfd

In [None]:
cfd.conditions()

#### Let's access the two conditions, and satisfy ourselves that each is just a frequency distribution:

In [None]:
print(cfd['news'])

In [None]:
print(cfd['romance'])

In [None]:
cfd['romance'].most_common(20)

In [None]:
cfd['romance']['could']

#### Exercise 3: Working with the news and romance genres from the Brown Corpus, find out which days of the week are most newsworthy, and which are most romantic. Define a variable called days containing a list of days of the week, i.e. ['Monday', ...]. Now tabulate the counts for these words using cfd.tabulate(samples=days). 

### 2.3 Generating Random Text with Bigrams

#### We can use a conditional frequency distribution to create a table of bigrams (word pairs). (We introducted bigrams in 3.) The bigrams() function takes a list of words and builds a list of consecutive word pairs.

In [None]:
sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
   'and', 'the', 'earth', '.']

In [None]:
list_sent=list(nltk.bigrams(sent))

In [None]:
list_sent

In [None]:
cfd = nltk.ConditionalFreqDist(list_sent)

In [None]:
cfd.conditions()

In [None]:
cfd.tabulate()

In [None]:
cfd["the"]

#### In 2.2, we treat each word as a condition, and for each one we effectively create a frequency distribution over the following words. The function generate_model() contains a simple loop to generate text. When we call the function, we choose a word (such as 'living') as our initial context, then once inside the loop, we print the current value of the variable word, and reset word to be the most likely token in that context (using max()); next time through the loop, we use that word as our new context. 

#### As you can see by inspecting the output, this simple approach to text generation tends to get stuck in loops

In [None]:
def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        word = cfdist[word].max()

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

In [None]:
cfd['living']

In [None]:
cfd['creature']

In [None]:
cfd['that']

In [None]:
cfd['he']

In [None]:
cfd['said']

In [None]:
cfd[',']

In [None]:
cfd['and']

In [None]:
cfd['the']

In [None]:
cfd['land']

In [None]:
cfd['of']

In [None]:
cfd['the']

In [None]:
cfd['land']

In [None]:
generate_model(cfd, 'living')

#### Generating Random Text: this program obtains all bigrams from the text of the book of Genesis, then constructs a conditional frequency distribution to record which words are most likely to follow a given word; e.g., after the word living, the most likely word is creature; the generate_model() function uses this data, and a seed word, to generate random text.

#### Conditional frequency distributions are a useful data structure for many NLP tasks. Their commonly-used methods are summarized

In [None]:
#Example	Description
#cfdist = ConditionalFreqDist(pairs)	create a conditional frequency distribution from a list of pairs
#cfdist.conditions()	the conditions
#cfdist[condition]	the frequency distribution for this condition
#cfdist[condition][sample]	frequency for the given sample for this condition
#cfdist.tabulate()	tabulate the conditional frequency distribution
#cfdist.tabulate(samples, conditions)	tabulation limited to the specified samples and conditions
#cfdist.plot()	graphical plot of the conditional frequency distribution
#cfdist.plot(samples, conditions)	graphical plot limited to the specified samples and conditions
#cfdist1 < cfdist2	test if samples in cfdist1 occur less frequently than in cfdist2

#### Q & A

### 3. More Python: Reusing Code

### 3.1   Creating Programs with Python

In [None]:
from monty import *

### 3.2   Functions (slides)

In [None]:
def lexical_diversity(my_text_data):
    word_count = len(my_text_data)
    vocab_size = len(set(my_text_data))
    diversity_score = vocab_size / word_count
    return diversity_score

#### Notice that we've created some new variables inside the body of the function. These are local variables and are not accessible outside the function. So now we have defined a function with the name lexical_diversity. But just defining it won't produce any output! Functions do nothing until they are "called" (or "invoked"):

In [None]:
from nltk.corpus import genesis

In [None]:
kjv = genesis.words('english-kjv.txt')

In [None]:
lexical_diversity(kjv)

#### Let's return to our earlier scenario, and actually define a simple function to work out English plurals. The function plural() in 3.1 takes a singular noun and generates a plural form, though it is not always correct.

In [None]:
def plural(word):
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    else:
        return word + 's'

In [None]:
plural('party')

In [None]:
plural('woman')

In [None]:
plural('dish')

### 3.3   Modules

#### Over time you will find that you create a variety of useful little text processing functions, and you end up copying them from old programs to new ones.  It makes life a lot easier if you can collect your work into a single place, and access previously defined functions without making copies.

#### To do this, save your function(s) in a file called (say) text_proc.py. Now, you can access your work simply by importing it from the file:

In [None]:
from text_proc import plural

In [None]:
plural('wish')

In [None]:
plural('man')

#### Our plural function obviously has an error, since the plural of fan is fans. Instead of typing in a new version of the function, we can simply edit the existing one. Thus, at every stage, there is only one version of our plural function, and no confusion about which one is being used.

#### A collection of variable and function definitions in a file is called a Python module. 

#### A collection of related modules is called a package.

#### NLTK's code for processing the Brown Corpus is an example of a module, and its collection of code for processing all the different corpora is an example of a package. 

#### NLTK itself is a set of packages, sometimes called a library

#### Q & A

## 4.Lexical Resources (Slides)

### 4.1 Wordlist Corpora

#### NLTK includes some corpora that are nothing more than wordlists. The Words Corpus is used by some spell checkers.  We can use it to find unusual or mis-spelt words in a text corpus

In [None]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

In [None]:
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))

#### Stopwords

#### There is also a corpus of stopwords, that is, high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.

In [None]:
from nltk.corpus import stopwords

In [None]:
stopwords.words('english')

#### Let's define a function to compute what fraction of words in a text are not in the stopwords list:

In [None]:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)

In [None]:
content_fraction(nltk.corpus.reuters.words())

#### Name corpus

#### One more wordlist corpus is the Names corpus, containing 8,000 first names categorized by gender. The male and female names are stored in separate files. Let's find names which appear in both files, i.e. names that are ambiguous for gender:

In [None]:
names = nltk.corpus.names

In [None]:
names.fileids()

In [None]:
male_names=names.words('male.txt')

In [None]:
female_names=names.words('female.txt')

In [None]:
[w for w in male_names if w in female_names]

In [None]:
cfd = nltk.ConditionalFreqDist(
           (fileid, name[-1])
           for fileid in names.fileids()
           for name in names.words(fileid))

In [None]:
cfd.plot()

#### Conditional Frequency Distribution: this plot shows the number of female and male names ending with each letter of the alphabet; most names ending with a, e or i are female; names ending in h and l are equally likely to be male or female; names ending in k, o, r, s, and t are likely to be male.

#### Q & A

### 5. WordNet (Slides)

### 5.1 Senses and Synonyms

#### Since everything else in the sentence has remained unchanged, we can conclude that the words motorcar and automobile have the same meaning, i.e. they are synonyms.

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
wn.synsets('motorcar')

#### Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. The entity car.n.01 is called a synset, or "synonym set", a collection of synonymous words (or "lemmas"):

In [None]:
wn.synset('car.n.01').lemma_names()

#### Each word of a synset can have several meanings, e.g., car can also signify a train carriage, or an elevator car. However, we are only interested in the single meaning that is common to all words of the above synset. Synsets also come with a prose definition and some example sentences:

In [None]:
wn.synset('car.n.01').definition()

In [None]:
wn.synset('car.n.01').examples()

#### Unlike the word motorcar, which is unambiguous and has one synset, the word car is ambiguous, having five synsets:

In [None]:
wn.synsets('car')

In [None]:
for synset in wn.synsets('car'):
    print(synset.lemma_names())

#### Exercise 4: Write down all the senses of the word dish that you can think of. Now, explore this word with the help of WordNet, using the same operations we used above.

### 5.2 The WordNet Hierarchy (Slides)

#### WordNet makes it easy to navigate between concepts. For example, given a concept like motorcar, we can look at the concepts that are more specific; the (immediate) hyponyms

In [None]:
motorcar=wn.synset('car.n.01')

In [None]:
types_of_motorcar=motorcar.hyponyms()

In [None]:
types_of_motorcar

In [None]:
types_of_motorcar[0]

#### We can also navigate up the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between car.n.01 and entity.n.01 because wheeled_vehicle.n.01 can be classified as both a vehicle and a container.

In [None]:
hyper=motorcar.hypernyms()

In [None]:
hyper

In [None]:
paths = motorcar.hypernym_paths()

In [None]:
len(paths)

In [None]:
[synset.name() for synset in paths[0]]

In [None]:
[synset.name() for synset in paths[1]]

#### We can get the most general hypernyms (or root hypernyms) of a synset as follows:

In [None]:
motorcar.root_hypernyms()

### 5.3   More Lexical Relations

#### Hypernyms and hyponyms are called lexical relations because they relate one synset to another. These two relations navigate up and down the "is-a" hierarchy. Another important way to navigate the WordNet network is from items to their components (meronyms) or to the things they are contained in (holonyms). For example, the parts of a tree are its trunk, crown, and so on; the part_meronyms(). The substance a tree is made of includes heartwood and sapwood; the substance_meronyms(). A collection of trees forms a forest; the member_holonyms():

In [None]:
wn.synset('tree.n.01').part_meronyms()

In [None]:
wn.synset('tree.n.01').substance_meronyms()

In [None]:
wn.synset('tree.n.01').member_holonyms()

#### To see just how intricate things can get, consider the word mint, which has several closely-related senses. We can see that mint.n.04 is part of mint.n.02 and the substance from which mint.n.05 is made.

In [None]:
for synset in wn.synsets('mint', wn.NOUN):
    print(synset.name() + ':', synset.definition())

In [None]:
wn.synset('mint.n.04').part_holonyms()

In [None]:
wn.synset('mint.n.04').substance_holonyms()

#### There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments:

In [None]:
wn.synset('walk.v.01').entailments()

In [None]:
wn.synset('eat.v.01').entailments()

In [None]:
wn.synset('tease.v.03').entailments()

#### Some lexical relationships hold between lemmas, e.g., antonymy:

In [None]:
wn.lemma('supply.n.02.supply').antonyms()

In [None]:
wn.lemma('rush.v.01.rush').antonyms()

In [None]:
wn.lemma('horizontal.a.01.horizontal').antonyms()

In [None]:
wn.lemma('staccato.r.01.staccato').antonyms()

### 5.4   Semantic Similarity

#### Recall that each synset has one or more hypernym paths that link it to a root hypernym such as entity.n.01. Two synsets linked to the same root may have several hypernyms in common (cf 5.1). If two synsets share a very specific hypernym — one that is low down in the hypernym hierarchy — they must be closely related.

In [None]:
right = wn.synset('right_whale.n.01')

In [None]:
right

In [None]:
orca = wn.synset('orca.n.01')

In [None]:
minke = wn.synset('minke_whale.n.01')

In [None]:
tortoise = wn.synset('tortoise.n.01')

In [None]:
novel = wn.synset('novel.n.01')

In [None]:
right.lowest_common_hypernyms(minke)

In [None]:
right.lowest_common_hypernyms(orca)

In [None]:
right.lowest_common_hypernyms(tortoise)

In [None]:
right.lowest_common_hypernyms(novel)

#### Of course we know that whale is very specific (and baleen whale even more so), while vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

In [None]:
wn.synset('baleen_whale.n.01').min_depth()

In [None]:
 wn.synset('whale.n.02').min_depth()

In [None]:
wn.synset('vertebrate.n.01').min_depth()

In [None]:
wn.synset('entity.n.01').min_depth()

#### Similarity measures have been defined over the collection of WordNet synsets which incorporate the above insight. For example, path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1. Consider the following similarity scores, relating right whale to minke whale, orca, tortoise, and novel.

In [None]:
right.path_similarity(minke)

In [None]:
right.path_similarity(orca)

In [None]:
right.path_similarity(tortoise)

In [None]:
right.path_similarity(novel)