# NLTK Corpora and Treebanks

In the field of Natural Langugage Processing, a sample of real world text is referred to as a *Corpus* (plural *Corpora*).

Possibly the most famous and widely used corpus is the **Brown Corpus** (https://en.wikipedia.org/wiki/Brown_Corpus) which was the first million-word electronic corpus of English, created in 1961 at Brown University.

A *Treebank* is an parsed and annotated Corpus that records the syntactic or semantic sentence structure of the content (i.e. typically each sentence and probably each word in the corpus).   The first large-scale treebank was The **Penn Treebank**, created in 1992 at the University of Pennsylvania. 

There are many corpora and treebanks distributed with the NLTK toolkit, listed in the NLTK Book here: http://www.nltk.org/nltk_data/

## Downloading and Exploring a Corpus ##

The Corpus are not installed by default - if you want to use one of the standard NLTK corpora, use the NLTK downloader as shown in the example below.

**Example** - Download the Reuters Corpus:


In [1]:
import nltk
nltk.download("reuters")

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\34320\AppData\Roaming\nltk_data...


True

Various methods can be applied against a loaded corpus to perform actions such as extract the words from corpus, identify the files that make up a corpus, list category definitions in a corpus and extract text filtered by category.

A list of common methods is provided in Chapter 2 of the [NLTK Book](http://www.nltk.org/book/ch02.html) as follows:  
<table align="left">
<tr><td><b>Example</b></td>   <td align="left"><b>Description</b></td></tr>
<tr><td>fileids()</td>   <td align="left">the files of the corpus</td></tr>
<tr><td>fileids([categories])</td>   <td align="left">the files of the corpus corresponding to these categories</td></tr>
<tr><td>categories()</td>   <td align="left">the categories of the corpus</td></tr>
<tr><td>categories([fileids])</td>   <td align="left">the categories of the corpus corresponding to these files</td></tr>
<tr><td>raw()</td>   <td align="left">the raw content of the corpus</td></tr>
<tr><td>raw(fileids=[f1,f2,f3])</td>   <td align="left">the raw content of the specified files</td></tr>
<tr><td>raw(categories=[c1,c2])</td>   <td align="left">the raw content of the specified categories</td></tr>
<tr><td>words()</td>   <td align="left">the words of the whole corpus</td></tr>
<tr><td>words(fileids=[f1,f2,f3])</td>   <td align="left">the words of the specified fileids</td></tr>
<tr><td>words(categories=[c1,c2])</td>   <td align="left">the words of the specified categories</td></tr>
<tr><td>sents()</td>   <td align="left">the sentences of the whole corpus</td></tr>
<tr><td>sents(fileids=[f1,f2,f3])</td>   <td align="left">the sentences of the specified fileids</td></tr>
<tr><td>sents(categories=[c1,c2])</td>   <td align="left">the sentences of the specified categories</td></tr>
<tr><td>abspath(fileid)</td>   <td align="left">the location of the given file on disk</td></tr>
<tr><td>encoding(fileid)</td>   <td align="left">the encoding of the file (if known)</td></tr>
<tr><td>open(fileid)</td>   <td align="left">open a stream for reading the given corpus file</td></tr>
<tr><td>root</td>   <td align="left">if the path to the root of locally installed corpus</td></tr>
<tr><td>readme()</td>   <td align="left">the contents of the README file of the corpus</td></tr>
</table>

#### Download some addition Corpora from NLTK####
These are for use later on in this set of Python Notebooks.

In [None]:
nltk.download('punkt')   # Punkt Tokenizer Model
nltk.download('averaged_perceptron_tagger')  # Part-of-Speech Tokeniser
nltk.download("stopwords") # Stopwords

#### Examples with the Reuters Corpus ####

The original Reuters Corpus was compiled from newswire articles from 1987.  In 1990 the documents were made available by Reuters and CGI *for research purposes only*.  One of the central people involved in developing this corpus was David Lewis, and he has published more information and background here: http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt

The original corpus contained 22,173 documents and was known as the "Reuters-22173" corpus.  This was then pared down by removing duplicated and became known as the Reuters-21578 collection.  

The Reuters Corpus is released with the following requirement: *"If you publish results based on this data set, please acknowledge its use, refer to the data set by the name 'Reuters-21578, Distribution 1.0', and inform your readers of the current location of the data set."*

### Categories, Files and Words ###

Load the corpus and list any categories defined in it:

In [1]:
from nltk.corpus import reuters
categories = reuters.categories()
print("Number of Categories:",len(categories))
print(categories[0:9],categories[-10:])

OSError: No such file or directory: 'C:\\Users\\34320\\AppData\\Roaming\\nltk_data\\corpora\\reuters\\cats.txt'

Individual words can be extracted with the `words()` method:

In [4]:
words = reuters.words()
print("number of words", len(words) )
print("first 10 words:", words[0:9])

number of words 1720901
first 10 words: ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-']


## Filter Text by Category ##

Get all news article words classified under "trade":

In [None]:
#Extract at a sepecific category
tradeWords = reuters.words(categories = 'trade')
len(tradeWords)

### Remove Stopwords and Punctuation ###



In [2]:
from nltk.corpus import stopwords
import string
print(stopwords.words('english'))

ModuleNotFoundError: No module named 'nltk'

In [None]:
# This takes a couple of minutes to run
tradeWords = [w for w in tradeWords if w.lower() not in stopwords.words('english') ]

In [1]:
tradeWords = [w for w in tradeWords if w not in string.punctuation]
punctCombo = [c+"\"" for c in string.punctuation ]+ ["\""+c for c in string.punctuation ]
tradeWords = [w for w in tradeWords if w not in punctCombo]
len(tradeWords)

NameError: name 'tradeWords' is not defined

### Word Frequency Distribution ###

In [None]:
fdist = nltk.FreqDist(tradeWords)
fdist.plot(20, cumulative=False)
# (installed Ghostscript on my PC to get this working in Jupyter

In [None]:
for word, frequency in fdist.most_common(10):
    print(word, frequency)

### Bi-Grams###


In [None]:
biTradeWords = nltk.bigrams(tradeWords)
biFdist = nltk.FreqDist(biTradeWords)
biFdist.plot(20, cumulative=False)

## Exploring the Penn Treebank ##
The NLTK data package includes a 10% sample of the Penn Treebank.  This can be used for exploring the concepts of Part-of-Speech tagging and the concept of sentence word classifications with grouping and heirarchies.

In [None]:
nltk.download('treebank')

In [None]:
from nltk.corpus import treebank

In [None]:
words = treebank.words()
tagged = treebank.tagged_words()
print(type(tagged))
print("Word Count", len(words))
print("Tagged words sample: ",tagged[0:9])

This treebank allows us to extract specific parsed sentences:

In [None]:
parsed = treebank.parsed_sents()[0]

In [None]:
print(parsed)

In [None]:
type(parsed)

In [None]:
import IPython
IPython.core.display.display(parsed)

## Further Investigation ##

+ Import the Brown Corpus  
+ List the categories  and count the number of words by category
+ Extract the first sentence from the corpus
+ Compare the average sentence length between Religion, News and Humor categories
+ Identify the top-20 words and bigrams in sentences categoried as "religion" and "news"