<a href="https://colab.research.google.com/github/giorgiosld/Natural-Language-Processing/blob/main/book/NLP_chapter2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accessing Text Corpora and Lexical Resources

## Accessing Text Corpora

Text corpus is a large body of text designt to contain a careful balance of material in one or more genres.


In [None]:
import nltk

nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

emma = nltk.corpus.gutenberg.words("austen-emma.txt")
len(emma)

# if you want see the concordancing of a text like in the first chapter
emma2 = nltk.Text(nltk.corpus.gutenberg.words("austen-emma.txt"))
emma2.concordance("surprize")

# to be more coincise python allows...
from nltk.corpus import gutenberg
gutenberg.fileids()
emma = gutenberg.words("austen-emma.txt")


With the for loop we can display three statistics for each text like average word length, average sentence length and number of times each vocabulary item appears in the text. Is possible notice that the average word length is 4 (actually 3 but the function counts also space).

In [15]:
# decomment this line the first time
# nltk.download('punkt')

for fileid in gutenberg.fileids():
  n_chars = len(gutenberg.raw(fileid))
  n_words = len(gutenberg.words(fileid))
  n_sents = len(gutenberg.sents(fileid))
  n_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
  print(round(n_chars/n_words), round(n_words/n_sents), round(n_words/n_vocab), fileid)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt
4 34 79 bible-kjv.txt
5 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 18 12 burgess-busterbrown.txt
4 20 13 carroll-alice.txt
5 20 12 chesterton-ball.txt
5 23 11 chesterton-brown.txt
5 18 11 chesterton-thursday.txt
4 21 25 edgeworth-parents.txt
5 26 15 melville-moby_dick.txt
5 52 11 milton-paradise.txt
4 12 9 shakespeare-caesar.txt
4 12 8 shakespeare-hamlet.txt
4 12 7 shakespeare-macbeth.txt
5 36 12 whitman-leaves.txt


We see that the raw() function is used to gives us the contents of the file without any linguistic processing. Instead sents() divides the text up into its sentences.

In [21]:
macbeth_sentences = gutenberg.sents("shakespeare-macbeth.txt")
macbeth_sentences
macbeth_sentences[1116]
longest_len = max(len(s) for s in macbeth_sentences)
[s for s in macbeth_sentences if len(s) == longest_len]

[['Doubtfull',
  'it',
  'stood',
  ',',
  'As',
  'two',
  'spent',
  'Swimmers',
  ',',
  'that',
  'doe',
  'cling',
  'together',
  ',',
  'And',
  'choake',
  'their',
  'Art',
  ':',
  'The',
  'mercilesse',
  'Macdonwald',
  '(',
  'Worthie',
  'to',
  'be',
  'a',
  'Rebell',
  ',',
  'for',
  'to',
  'that',
  'The',
  'multiplying',
  'Villanies',
  'of',
  'Nature',
  'Doe',
  'swarme',
  'vpon',
  'him',
  ')',
  'from',
  'the',
  'Westerne',
  'Isles',
  'Of',
  'Kernes',
  'and',
  'Gallowgrosses',
  'is',
  'supply',
  "'",
  'd',
  ',',
  'And',
  'Fortune',
  'on',
  'his',
  'damned',
  'Quarry',
  'smiling',
  ',',
  'Shew',
  "'",
  'd',
  'like',
  'a',
  'Rebells',
  'Whore',
  ':',
  'but',
  'all',
  "'",
  's',
  'too',
  'weake',
  ':',
  'For',
  'braue',
  'Macbeth',
  '(',
  'well',
  'hee',
  'deserues',
  'that',
  'Name',
  ')',
  'Disdayning',
  'Fortune',
  ',',
  'with',
  'his',
  'brandisht',
  'Steele',
  ',',
  'Which',
  'smoak',
  "'",
  'd',
 

### Brown Corpus
In this example it has been used a dataset collected by the Brown Univesity for studying systematic differences between genres, a kind of linguistic inquiry known as **stylistic**.  

In [27]:
import nltk

nltk.download("brown")

brown.categories()
brown.words(categories="news")
brown.words(fileids=["cg22"])
brown.sents(categories=["news", "editorial", "reviews"])

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

Now compare genres in their usage of modal verbs producing the counts for a particular genre.

In [28]:
news_text = brown.words(categories="news")
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
  print(f"{m}: {fdist[m]}")

can: 94
could: 87
may: 93
might: 38
must: 53
will: 389


The next step is to obtain the counts for each genre of interest using the conditional frequency distribution.

In [29]:
cfd = nltk.ConditionalFreqDist((genre, word)
                              for genre in brown.categories()
                              for word in brown.words(categories=genre))
genres = ["news", "religion", "hobbies", "science_fiction", "romance", "humor"]
modals = ["can", "could", "may", "might", "must", "will"]
cfd.tabulate(condition=genres, samples=modals)

                  can could   may might  must  will 
      adventure    46   151     5    58    27    50 
 belles_lettres   246   213   207   113   170   236 
      editorial   121    56    74    39    53   233 
        fiction    37   166     8    44    55    52 
     government   117    38   153    13   102   244 
        hobbies   268    58   131    22    83   264 
          humor    16    30     8     8     9    13 
        learned   365   159   324   128   202   340 
           lore   170   141   165    49    96   175 
        mystery    42   141    13    57    30    20 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        reviews    45    40    45    26    19    58 
        romance    74   193    11    51    45    43 
science_fiction    16    49     4    12     8    16 
