# 2. Accessing Text Corpora and Lexical Resources #
## 1 Accessing Text Corpora ##
+  A text corpus is a large body of text

### 1.1 Gutenberg Corpus ###
+ [Project Gutenberg](http://www.gutenberg.org/) is an electronic text archive, which contains some 25,000 free electronic books.

In [23]:
import nltk
from nltk.corpus import gutenberg as gb

print("gutenberg fileids\n:{0}\n".format(gb.fileids()))
emma=gb.words("austen-emma.txt")
print("emma length:\n{0}\n".format(len(emma)))

## for concordance, we need to convert the words to Text contrary to chapter 1
emma_text=nltk.Text(emma)
emma_text.concordance("surprise")

## This program displays three statistics for each text: average word length, average sentence length, 
## and the number of times each vocabulary item appears in the text on average (our lexical diversity score). 
## Observe that average word length appears to be a general property of English, since it has a recurrent 
## value of 4. (In fact, the average word length is really 3 not 4, since the num_chars variable counts space 
## characters.) By contrast average sentence length and lexical diversity appear to be characteristics of 
## particular authors.
print("\n:average word length, average sentence length, lexical diversity, fileid")
for fileid in gb.fileids():
     num_chars = len(gb.raw(fileid))
     num_words = len(gb.words(fileid))
     num_sents = len(gb.sents(fileid))
     num_vocab = len(set(w.lower() for w in gb.words(fileid)))
     print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)


gutenberg fileids
:['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

emma length:
192427

Displaying 1 of 1 matches:
 that Emma could not but feel some surprise , and a little displeasure , on he

:average word length, average sentence length, lexical diversity, fileid
5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt


In [26]:
## The sents() function divides the text up into its sentences, where each sentence is a list of words
macbeth_sentences=gb.sents("shakespeare-macbeth.txt")
print("\nmacbeth_sentences:\n{0}\n".format(macbeth_sentences))
print("macbeth_sentences[1116]:\n{0}\n".format(macbeth_sentences[1116]))
longest_len=max(len(s) for s in macbeth_sentences)
print([s for s in macbeth_sentences if len(s) == longest_len])


macbeth_sentences:
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

macbeth_sentences[1116]:
['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';', 'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble']

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fort