# Exploring the NLTK Book (Chapter 2)

This notebook is based on the exercises described in [NLTK Book, Chapter 2: Accessing Text Corpora and Lexical Resources](https://www.nltk.org/book/ch02.html)
<footer style="text-align:right;font-size:.8em;">Source: Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. http://nltk.org/book</footer>


* Corpora in NLTK
* useful Python constructs
* Some Python shortcuts

Resources:
* [NLTK documentation](https://www.nltk.org/)
* [NLTK API Documentaiton](https://www.nltk.org/api/nltk.html)
* [NLTK data and corpus available](http://www.nltk.org/nltk_data/)
* [Corpus HOWTO at NLTK.org](http://www.nltk.org/howto/corpus.html)
* [PyCharm](https://www.jetbrains.com/pycharm/download/#section=mac) An Integrated Data Environment (IDE) for working with Python. Has a greater focus on writing python code
* [BootCat](https://bootcat.dipintra.it/)  Tool for gathering text from the internet.
* [brat rapid annotation tool](http://brat.nlplab.org/index.html)
* [Linguistic Data Consortium](https://www.ldc.upenn.edu/) A source for corpora
* [European Language Resources Association](http://portal.elda.org/en/) Many useful language resource
* [WordNet](https://wordnet.princeton.edu/) provides an onlie database for searching WordNet synsets.
* [Global WordNet](http://globalwordnet.org/) provide information on and sharing of wordnet corpora in many languages.
* [Ethnologue](https://www.ethnologue.com/) Information about every known living language of the world.

## The Gutneberg Corpus

A **corpus** is a large collection of text, texs, documents. 

[Project Gutneberg](https://www.gutenberg.org/) is a large collection (over 57,000) of digitized text for which copyright has expired, i.e. they are freely available.

NLTK includes serveral corpus resources as part of its corpus module. 
To begin we will need to download the module. We will only download the guteneberg corpus but there are many more corpera in the corpus module. It is more common to only download the portions that are necessary. When we initiate the downloader you can see all the options. 

In [None]:
# import nltk and look at the gutneberg corpus texts
import nltk
nltk.download()

Now we can access the gutenberg corpus. Observe that every text in the corpus has a `fileid`. We use this when we want to access the text.

In [None]:
# I am going to add a keyword for the corpus to make refering to it easier. 
# note that this may make it difficult for others to read when sharing notebooks.
from nltk.corpus import gutenberg as gtb
gtb.fileids()

First let's pick one of these texts and take a quick look at it. 

In [None]:
paradise_lost = nltk.corpus.gutenberg.words('milton-paradise.txt')
len(paradise_lost)

In [None]:
paradise_lost.concordance('alone')

You can see a difference between the corpus module and the book module that we used earlier. If we want to use the same functions we used in the book module we need to get the text into the correct data type. In this case `Text` class.

In [None]:
paradise_lost.citation()

In [None]:
type(paradise_lost)

To use the same viewer we used in the book module we need to transform the Gutenberg text into `Text` data type. Import the gutenberg corpus and transform it into an NLTK `text.Text` so we can work with is as we did the book collection in Chapter 1.

In [None]:
paradise_lost_Text = nltk.Text(nltk.corpus.gutenberg.words('milton-paradise.txt'))
paradise_lost_Text.concordance('alone')

Let's continue to work with a text from the gutneberg corpus

In [None]:
gtb.fileids()

We can write a short function to look at some of the characteristics of the gutenberg texts. This will show us the average word length, the average number of words per sentence, and how often words in the vocabulary are used.

In [None]:
for file in gtb.fileids():
    #the number of characters in a file
    char = len(gtb.raw(file))
    #the number of words in a file
    words = len(gtb.words(file))
    #the number of senteces in a file
    sent = len(gtb.sents(file))
    #the unique collection of words i.e. no repeats
    vocab = len(set(w.lower() for w in gtb.words(file)))
    print (file + '\n  avg word length: ' + str(round(char/words)) + '\n  avg sentence length: ' + str(round(words/sent)) + '\n  times vocab used: ' + str(round(words/vocab)))

In [None]:
#notice raw() returns the untokenized version of the text. compare the two functions below
print(gtb.raw('blake-poems.txt'))

In [None]:
print(gtb.words('blake-poems.txt'))

In [None]:
#find the longest sentence in a text:
hamlet = gtb.sents('shakespeare-hamlet.txt')
longest_sent = max(len(s) for s in hamlet)
[s for s in hamlet if len(s) == longest_sent]

In [None]:
gtb.sents('shakespeare-hamlet.txt')

In [None]:
type(hamlet)

You have seen the Gutenberg NLTK corpus reader has the following access methods: `raw()`, `words()` and `sents()`. Other corpa typically have these methods and may also provide more detailed access such as parts of speech access.

## Web and Chat Corpus

### web
This corpus is less formal prose and more representative of common language usage. It is made up of content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews

In [None]:
from nltk.corpus import webtext

In [None]:
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...')

In [None]:
grail = webtext.sents('grail.txt')
print(grail)

In [None]:
singles = webtext.raw('singles.txt')
print(singles)

### chat
NLTK contains a corpus of instant messaging chat sessions "originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators." The corpus contains over 10,000 texts that have been anonamized. The fles contain chats from age specific chat rooms for a specific date. Some metadata is contained in the filename: date, chatroom, and number of posts (e.g. 10-19-20s_706posts.xml --> Date = 10-19, chatroom = 20's, number of posts = 706.)

In [None]:
from nltk.corpus import nps_chat
for file in nps_chat.fileids():
    print(file)

In [None]:
chat = nps_chat.posts('10-19-adults_706posts.xml')

In [None]:
len(chat)

In [None]:
chat[100:120]

Note: 'U##' is the anonamized username.

## Brown Corpus
"The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on." 

Sample:

|ID |File	|Genre	|Description|
|----|----|---------|---------|
|A16	|ca16	|news	|Chicago Tribune: Society Reportage|
|B02	|cb02	|editorial	|Christian Science Monitor: Editorials|
|C17	|cc17	|reviews	|Time Magazine: Reviews|
|D12	|cd12	|religion	|Underwood: Probing the Ethics of Realtors|
|E36	|ce36	|hobbies	|Norling: Renting a Car in Europe|

For a complete genre list, see http://icame.uib.no/brown/bcm-los.html).

For a more detailed overview [The Brown Corpus Manual](http://www.hit.uib.no/icame/brown/bcm.html)

First lets import the Brown Corpus and see what the categories are.

In [None]:
from nltk.corpus import brown
brown.categories()

We can also get a list of the files. 

In [None]:
brown.fileids()

We can get a collection of texts based on the category

In [None]:
mystery = brown.fileids(categories='mystery')

In [None]:
print(mystery)

We can also get the text of these files.

In [None]:
mystery_text=brown.words(categories='mystery')

In [None]:
print(len(mystery_text))
print(mystery_text)

There are many ways to access the text in the Brown corpus.

In [None]:
brown.words(fileids=['cg17'])

In [None]:
brown.sents(categories=['news', 'editorial', 'reviews'])

One interesting exercise to do with this Corpus is to look at word use across different categories using Frequency Distribution.

In [None]:
news_text = brown.words(categories='news')
fq_dist = nltk.FreqDist(w.lower() for w in news_text)
word_list = ['death','freedom','space','alien']
for w in word_list:
    print (w + ':',fq_dist[w], end='\n')

This is a good place to introduce Conditional Frequency Distribution. This will allow us to look at the distribution of words across corpus based on specific conditions. In this case the condition is that it be a part of one of the genre we choose.

In [None]:
con_fq_dist = nltk.ConditionalFreqDist((genre,word)
                   for genre in brown.categories()
                   for word in brown.words(categories=genre))
genre_list=['science_fiction','news']
word_list = ['alien','death','freedom']
con_fq_dist.tabulate(conditions=genre_list,samples=word_list)

## Reuters Corpus

This is a collection of news articles from Reuters. The collection documents have 90 classification topics and it is divided into test and training files for testing automatic document classification alogrithms. This begins to get into the area of machine learning and we will discuss this in more detail later. 


In [None]:
from nltk.corpus import reuters 

In [None]:
reuters.fileids()

In [None]:
reuters.categories()

In [None]:
# We can find texts based on category.
reuters.fileids('yen')

In [None]:
reuters.words('training/9946')[:40]

In [None]:
reuters.sents('training/9946')

## Inaugural Address Corpus

This is a collection of 55 inaugaural addressed. This collection is interesting because it contains temporal data that can be used to look at changes in the text across time. 

In [None]:
from nltk.corpus import inaugural

In [None]:
# Note the file name contains the year of the speech and the name of the president
inaugural.fileids()

In [None]:
inaugural.words('2009-Obama.txt')

The year of each text appears in the file ID. To get the year you must extract it from the title. 

In [None]:
years = [file[:4] for file in inaugural.fileids()]
years

Now we can use the Conditional Frequency Distribution again to get the freq distribution of particular words in each year

In [None]:
# plot the use of words in speeched over time.
con_fq_dist = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
        for w in inaugural.words(fileid)
            for target in ['war','freedom']
            if w.lower().startswith(target))

%matplotlib inline
con_fq_dist.plot()

## Load your own corpus

if you have a collection of files you can load it using the NLTK PlaintextCorpusReader. I have created a collection of text from the internet by doing a web scrape for pages about funny cat videos using [BootCat](https://bootcat.dipintra.it/). Import this corpus and use the tools we have looked at to explore what it might contain. Full disclosure: I have note looked at it closely so I cannot tell you anything about it either. 

In [None]:
from nltk.corpus import PlaintextCorpusReader
root = "cat_corpus"
cats = PlaintextCorpusReader(root, '.*')
cats.fileids()

In [None]:
cats.words('80.txt')[:100]

In [None]:
cats.sents('01.txt')

Try reusing code from earlier examinations and think about:

* Is there anything you wish you had a function for?
* What types of features are most useful?
* Is there anything missing from the data you wish you had?

In [None]:
# can we adapt this code from the gutenberg corpus to give us a broad overview of this cat corpus.
for file in gtb.fileids():
    #the number of characters in a file
    char = len(gtb.raw(file))
    #the number of words in a file
    words = len(gtb.words(file))
    #the number of senteces in a file
    sent = len(gtb.sents(file))
    #the unique collection of words i.e. no repeats
    vocab = len(set(w.lower() for w in gtb.words(file)))
    print (file + '\n  avg word length: ' + str(round(char/words)) + '\n  avg sentence length: ' + str(round(words/sent)) + '\n  times vocab used: ' + str(round(words/vocab)))

## Conditional Frequency Distributions

We used the Conditional Frequency Distributions utility earlier. Let's go back to that and see how it works.

This method of analysis generates a frequency distribution for a text that meets a certain criteria set out. With Conditional Frequency Distribution we can categorize and compare frequency distribution based on conditions we set out. For example, we could compare the frequency of color in different genre of text from the Brown Corpus; how often is color used in romance compared to science fiction.

To begin we must understand that in conditional frequency we are not just comparing a list of words:

`['red','green','blue','gold','yellow']`

But we are looking at the frequency of a pair of terms, the condition and the term:

`[('romance','red'),('romance','green'),('romance','blue'),('romance','gold'),('romance','yellow')]`

Let's try this with the Browm Corpus

In [None]:
from nltk.corpus import brown
brown.categories()

In [None]:
# create a word list based on our criteria of romance and science fiction:
genre_wordlist = [(genre,word)
                 for genre in ['romance', 'science_fiction']
                 for word in brown.words(categories=genre)]

In [None]:
# let's look at the list we create.
# notice that we have a list of tuples.
genre_wordlist[:6], genre_wordlist[-6:]

We can see that we now have each word paired with the category. Understandably this is a large list.


In [None]:
len(genre_wordlist)

Now we can use the `ConditionalFreqDist` to look at the difference in distribution.

In [None]:
# This will show us the frequency distribution of all words in each genre we chose for our list
cfd = nltk.ConditionalFreqDist(genre_wordlist)
cfd

In [None]:
# look at each of the distributions
print(cfd['romance'])

In [None]:
print(cfd['science_fiction'])

In [None]:
cfd['romance'].most_common(10),cfd['science_fiction'].most_common(10)

We should probably start making removing punctuation and stopwords a habit.

### plot() and tabulate()

With `plot()` we can plot out the frequency of a distribution on a graph. This is often useful when time is a a variable, as we did with the inaugural address. However, building on our previous example we can select some key words and plot the differences.

In [None]:
color_dif = nltk.ConditionalFreqDist(
    (genre,word)
    for genre in ['romance','science_fiction']
    for w in brown.words(categories=genre)
    for word in ['red','gold','black','white','yellow'] if w.lower() == word)

In [None]:
type(color_dif)

In [None]:
color_dif.plot()

We can also represent this Conditional Frequency Distribution in a table.

In [None]:
color_dif.tabulate()


And use keywords to refine the table.

In [None]:
color_dif.tabulate(conditions = ['romance','science_fiction'],samples=['black','white','red'])

Have you ever wondered what category talks about what day mostr frequently?

In [None]:
days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

In [None]:
cfd_days = nltk.ConditionalFreqDist(
        (genre,word)
        for genre in ['news','romance']
        for word in brown.words(categories=genre))
         
        

In [None]:
cfd_days.tabulate(samples=days)

## Lexical Resources

### Stopwords

There is a corpus availael in nltk that is a collection of words and punctuation that occur with a high-frequency but do not affect meaning of a text. THese are words like 'a', 'the', 'but'. These stop words can often be filtered out of a text to remove 'noise' from some types of analysis, such as analysing subject matter. 

In [None]:
# take a look at the stopwords
from nltk.corpus import stopwords
stopwords.words('english')

In [None]:
#You can see that stopwords are availabel in many languages.
stopwords.fileids()

**Let's look at a text with and without the stopwords and see how it impacts the results.**

In [None]:
def stopword_distribution(text):
    stopwords = nltk.corpus.stopwords.words('english')
    count = [word for word in text if word.lower() in stopwords]
    print ('Count of stopwords in text: ' + str(len(count)) + ' out of a total of ' + str(len(text)) + ' words')
    print ('That is a text that is',(len(count)/len(text)*100),'% stopwords, not including punctuation.')
    
stopword_distribution(nltk.corpus.brown.words(categories='romance'))

#### Names Corpus
There is also a wordlist of 8,000 first names sorted by gender into separate files. You can use this to find names in text.

In [None]:
# For this we can see what names are shared by men and women
from nltk.corpus import names
names.fileids()

By looking at names that appear in both lists we can generate a list of names that are not gender specific.

In [None]:
nongendered_names = [name for name in names.words('male.txt') if name in names.words('female.txt')]

In [None]:
nongendered_names

In [None]:
len(nongendered_names)

In [None]:
# or look at a frequency distribution, is there a letter more common with men or women's name?
cfd = nltk.ConditionalFreqDist(
    (gender, name[0])
    for gender in names.fileids()
    for name in names.words(gender))
cfd.plot()

It looks like the counts are throwing this off. But we can still see some interesting facets of the data. And we could write a function to look at the percentages. (right?) There are simply more female names in our sample.

In [None]:
len(names.words('male.txt')), len(names.words('female.txt'))

In [None]:
def name_perc():
    import string
    letters = list((string.ascii_uppercase))
    for l in letters:
        ftotal = [n for n in names.words('female.txt') if n[0] == l]
        mtotal = [n for n in names.words('male.txt') if n[0] == l]
        fperc = len(ftotal)/len(names.words('female.txt')) * 100
        mperc = len(mtotal)/len(names.words('male.txt')) * 100
        print (l + '\n   Female percent: ' + str(round(fperc,2)) + '\n   Male percent:   ' + str(round(mperc,2)) +'\n')
 

In [None]:
name_perc()

When we look further into Machine Learning and Natural Language Processing, we will see how to establish more uniformity to test data.

### WordNet

[WordNet](https://wordnet.princeton.edu/)

WordNet is a large, English, lexical database that groups parts of speech into "synsets" (essentially groups of synonyms) that express a concept (e.g. "bicycle" and "bike"). NLTK provides access to the English WordNet.

In [None]:
# this will give us the sysnset identification for the different concepts that ""bicycle" is used to express.
from nltk.corpus import wordnet as wn
wn.synsets('bicycle')

We can see we have two possibilities. One bicycle as a noun (`bicycle.n.01` with the `n` denoting a noun) and a verb (`bicycle.v.01` with the `v` denoting a verb).

Let's take a closet look at these. Using `lemma_names()` will give us the othe lemma that are assoociated with our term as synonyms.

In [None]:
# we can look at one synset
wn.synset('bicycle.n.01').lemma_names()

In [None]:
# This will show us all of the terms in all of the lemmas
for lm in wn.synsets('bicycle'):
    ls = lm.lemma_names()
    lma = lm
    print (str(lma) + ': ' + str(ls))

We can see a definition of the concept

In [None]:
wn.synset('bicycle.n.01').definition()

some concepts even have examples of usage.

In [None]:
wn.synset('car.n.01').examples()

You can also identify the words by the lemma to eliminate abiguity

In [None]:
#This will show us all the lemmas in the synset bicycle.n.01 
wn.synset('bicycle.v.01').lemmas()

In [None]:
wn.lemma('bicycle.n.01.bike')

In [None]:
# This may seem redundant to us but it is useful for a computer. 
# we can get the lemma from one of its inflected forms.
wn.lemma('bicycle.n.01.bike').synset()

Also, we can get all of the lemmas that contain a selected word.

In [None]:
wn.lemmas('bike')

### Hierarchies

With WordNet synsets we can also look at associations between concepts. Some basic concpets are broad and make up the root, but from that root it is possible to navigate to more specific instances of the root's broader concept.

Let's look at some of the more specific instances of the root concept, the **hyponyms**

In [None]:
bike = wn.synset('bike.n.01')
types_of_bikes = bike.hyponyms()
types_of_bikes

In [None]:
# think of this constructor this way --> `wn.synset('minibike.n.01').lemmas()[0].name()`

sorted(lemma.name() for synset in types_of_bikes for lemma in synset.lemmas())

In [None]:
types_of_bikes[0].lemmas()[0].name()

We can also look up the hierarchy at hypernyms. 

In [None]:
bike.hypernyms()

In [None]:
paths = bike.hypernym_paths()

In [None]:
paths

In [None]:
len(paths)

In [None]:
# this looks at one of the two paths to reach our original synset bike.n.01
[synset.name() for synset in paths[0]]

In [None]:
[synset.name() for synset in paths[1]]

In [None]:
# and we can see the root hypernyms for all the paths
bike.root_hypernyms()

Another relationship that can be explored between synsets are the `meronyms` and `holonyms`. `Meronyms` are componets of the term, and `holonyms` are things that contain the term. There are substance and parts for these two lexical relations. a substance consist of what it is made up of, while a part are sections or elements of the term. For example, a part meronym for "bike" is "kick stand". A tree's substance is the type of wood it is.

In [None]:
# this will show all of the meronyms of the "bike.n.01" synset
bike.part_meronyms()

In [None]:
wn.synset('tree.n.01').substance_meronyms()

In [None]:
wn.synset('tree.n.01').part_meronyms()

In [None]:
# a bike has no holonyms so let's look at a tree. 
wn.synsets('tree')

In [None]:
wn.synset('tree.n.01').lemma_names()

In [None]:
wn.synset('tree.n.01').member_holonyms()

In [None]:
# This can become complicated when diferent synsets of the same word begin to relate to other.
for synset in wn.synsets('mint', wn.NOUN):
    print(synset.name() + ':', synset.definition())



In [None]:
wn.synset('mint.n.04').part_holonyms()

In [None]:
wn.synset('mint.n.04').substance_holonyms()

Another type of relationship exists between verbs, **entailments**. The act of eating involves the act of chewing so the verb "eat" entails the verb "chew". 

In [None]:
wn.synsets('eating')

In [None]:
wn.synset('eat.v.01').entailments()

Another relationship between synset lemmas is **antonymy**, or antonyms. These are lemmas with a meaning counter to another.

In [None]:
wn.synsets('hard')

In [None]:
wn.synset('hard.a.02').lemmas()

In [None]:
wn.lemma('hard.a.02.hard').antonyms()

In [None]:
wn.lemma('rush.v.01.rush').antonyms()