# Text Corpora and Lexical Resources

based on the NLTK book:

["Accessing Text Corpora and Lexical Resources"](https://www.nltk.org/book/ch02.html)

In [1]:
import nltk

## NLTK Text Corpora

NLTK includes many text collections (corpora) and other language resources, listed here: http://www.nltk.org/nltk_data/

Additional information:
* [NLTK Corpus How-to](http://www.nltk.org/howto/corpus.html)

In order to use these resources you may need to download them using `nltk.download()`

---

**NLTK book: ["Text Corpus Structure"](https://www.nltk.org/book/ch02#text-corpus-structure)**

There are different types of corpora:
* simple collections of text (e.g. Gutenberg corpus)
* categorized (texts are grouped into categories that might correspond to genre, source, author)
* temporal, demonstrating language use over a time period (e.g. news texts)

![Types of NLTK corpora](https://www.nltk.org/images/text-corpus-structure.png)

There are also annotated text corpora that contain linguistic annotations, representing POS tags, named entities, semantic roles, etc. 

### 1) Gutenberg Corpus

NLTK includes a small selection of texts (= multiple files) from the Project Gutenberg electronic text archive:

In [2]:
# let's explore its contents:

nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [3]:
# "Emma" by Jane Austen

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

print(emma)

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]


In [4]:
# you can access corpus texts as characters, words (tokens) or sentences:

file_id = 'austen-emma.txt'

print("\nSentences:")
print( nltk.corpus.gutenberg.sents(file_id)[:3] )

print("\nWords:")
print( nltk.corpus.gutenberg.words(file_id)[:10] )

print("\nChars:")
print( nltk.corpus.gutenberg.raw(file_id)[:50] )


Sentences:
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ['CHAPTER', 'I']]

Words:
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

Chars:
[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I





See https://www.nltk.org/book/ch02#gutenberg-corpus on how to compute statistics of words, sentences and characters (e.g. avg words per sentence).

---

### 2) Brown corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.

This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on.

In [5]:
# Brown corpus categories list:

from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [6]:
# We can filter the corpus by (a) one or more categories or (b) file IDs:

print(brown.sents(categories='science_fiction')[:2])

[['Now', 'that', 'he', 'knew', 'himself', 'to', 'be', 'self', 'he', 'was', 'free', 'to', 'grok', 'ever', 'closer', 'to', 'his', 'brothers', ',', 'merge', 'without', 'let', '.'], ["Self's", 'integrity', 'was', 'and', 'is', 'and', 'ever', 'had', 'been', '.']]


In [7]:
print(brown.sents(categories=['news', 'editorial', 'reviews']))

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]


In [8]:
print(brown.words(fileids=['cg22']))

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]


We can use NLTK **ConditionalFreqDist** to collect statistics on the corpus distribution across genres and other properties:

In [9]:
cfd = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']

cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 


In [10]:
import pandas as pd

In [11]:
table = cfd.tabulate(conditions=genres, samples=modals)

                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13 


In [12]:
table

#### Brown corpus contains tags with part-of-speech information

[Working with Tagged Corpora](https://www.nltk.org/book/ch05#tagged-corpora) (NLTK book)

In [13]:
words = nltk.corpus.brown.tagged_words(tagset='universal')
words

[('The', 'DET'), ('Fulton', 'NOUN'), ...]

In [14]:
# islice() lets us read a part of the corpus

from itertools import islice
words = islice(words, 300)

# let's convert it to a list
word_list = list(words)
word_list

[('The', 'DET'),
 ('Fulton', 'NOUN'),
 ('County', 'NOUN'),
 ('Grand', 'ADJ'),
 ('Jury', 'NOUN'),
 ('said', 'VERB'),
 ('Friday', 'NOUN'),
 ('an', 'DET'),
 ('investigation', 'NOUN'),
 ('of', 'ADP'),
 ("Atlanta's", 'NOUN'),
 ('recent', 'ADJ'),
 ('primary', 'NOUN'),
 ('election', 'NOUN'),
 ('produced', 'VERB'),
 ('``', '.'),
 ('no', 'DET'),
 ('evidence', 'NOUN'),
 ("''", '.'),
 ('that', 'ADP'),
 ('any', 'DET'),
 ('irregularities', 'NOUN'),
 ('took', 'VERB'),
 ('place', 'NOUN'),
 ('.', '.'),
 ('The', 'DET'),
 ('jury', 'NOUN'),
 ('further', 'ADV'),
 ('said', 'VERB'),
 ('in', 'ADP'),
 ('term-end', 'NOUN'),
 ('presentments', 'NOUN'),
 ('that', 'ADP'),
 ('the', 'DET'),
 ('City', 'NOUN'),
 ('Executive', 'ADJ'),
 ('Committee', 'NOUN'),
 (',', '.'),
 ('which', 'DET'),
 ('had', 'VERB'),
 ('over-all', 'ADJ'),
 ('charge', 'NOUN'),
 ('of', 'ADP'),
 ('the', 'DET'),
 ('election', 'NOUN'),
 (',', '.'),
 ('``', '.'),
 ('deserves', 'VERB'),
 ('the', 'DET'),
 ('praise', 'NOUN'),
 ('and', 'CONJ'),
 ('thanks'

In [15]:
# find all words with POS tag "ADJ"

tag = 'ADJ'

[item[0] for item in word_list if item[1] == tag]

['Grand',
 'recent',
 'Executive',
 'over-all',
 'Superior',
 'possible',
 'hard-fought',
 'relative',
 'such',
 'widespread',
 'many',
 'outmoded',
 'inadequate',
 'ambiguous',
 'grand',
 'other',
 'best',
 'greater',
 'clerical']

**Additional examples** (using FreqDist, ...):
    
[Working with Tagged Corpora](https://www.nltk.org/book/ch05#tagged-corpora)

### 3) NLTK Corpus functionality

* fileids()  = the files of the corpus
* fileids([categories])  = the files of the corpus corresponding to these categories

* categories()  = the categories of the corpus
* categories([fileids])  = the categories of the corpus corresponding to these files

* raw()  = the raw content of the corpus
* raw(fileids=[f1,f2,f3])  = the raw content of the specified files
* raw(categories=[c1,c2])  = the raw content of the specified categories

* words()  = the words of the whole corpus
* words(fileids=[f1,f2,f3])  = the words of the specified fileids
* words(categories=[c1,c2])  = the words of the specified categories

* sents()  = the sentences of the whole corpus
* sents(fileids=[f1,f2,f3])  = the sentences of the specified fileids
* sents(categories=[c1,c2])  = the sentences of the specified categories

* abspath(fileid)  = the location of the given file on disk
* encoding(fileid)  = the encoding of the file (if known)
* open(fileid)  = open a stream for reading the given corpus file
* root  = if the path to the root of locally installed corpus

* readme()  = the contents of the README file of the corpus


**Note: if you want to explore these corpora using `nltk.Text` functionality (e.g. as in the Introduction part) you will need to load them into `nltk.Text`**

# Your turn!

Choose one of NLTK corpora and **explore it using NLTK** (following examples here and in the NLTK book).

Also apply what you learned (FreqDist, ...) in section "Computing with Language: Statistics".

---

**Write code in notebook cells below**.
* add more cells (use "+" icon) if necessary

## Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions.

https://www.nltk.org/book/ch02#lexical-resources

We already used NLTK lexical resources (stopwords and common English words).

## WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. 

In [16]:
from nltk.corpus import wordnet as wn

In [17]:
# a collection of synonym sets related to "wind"

wn.synsets('wind')

[Synset('wind.n.01'),
 Synset('wind.n.02'),
 Synset('wind.n.03'),
 Synset('wind.n.04'),
 Synset('tip.n.03'),
 Synset('wind_instrument.n.01'),
 Synset('fart.n.01'),
 Synset('wind.n.08'),
 Synset('weave.v.04'),
 Synset('wind.v.02'),
 Synset('wind.v.03'),
 Synset('scent.v.02'),
 Synset('wind.v.05'),
 Synset('wreathe.v.03'),
 Synset('hoist.v.01')]

In [23]:
len(wn.synsets('get'))

37

In [18]:
# words (lemmas) in one of synsets:

wn.synset('wind.n.08').lemma_names()

['wind', 'winding', 'twist']

In [19]:
wn.synset('wind.n.08').definition()

'the act of winding or twisting'

In [20]:
wn.synset('wind.n.08').examples()

['he put the key in the old clock and gave it a good wind']

In [24]:
# let's explore all the synsets for this word

for synset in wn.synsets('wind'):
    print(synset.lemma_names())

['wind', 'air_current', 'current_of_air']
['wind']
['wind']
['wind', 'malarkey', 'malarky', 'idle_words', 'jazz', 'nothingness']
['tip', 'lead', 'steer', 'confidential_information', 'wind', 'hint']
['wind_instrument', 'wind']
['fart', 'farting', 'flatus', 'wind', 'breaking_wind']
['wind', 'winding', 'twist']
['weave', 'wind', 'thread', 'meander', 'wander']
['wind', 'twist', 'curve']
['wind', 'wrap', 'roll', 'twine']
['scent', 'nose', 'wind']
['wind', 'wind_up']
['wreathe', 'wind']
['hoist', 'lift', 'wind']


In [25]:
# see all synsets that contain a given word

wn.lemmas('curve')

[Lemma('curve.n.01.curve'),
 Lemma('curve.n.02.curve'),
 Lemma('curve.n.03.curve'),
 Lemma('curvature.n.03.curve'),
 Lemma('bend.n.03.curve'),
 Lemma('swerve.v.01.curve'),
 Lemma('wind.v.02.curve'),
 Lemma('arch.v.01.curve'),
 Lemma('crook.v.01.curve'),
 Lemma('curl.v.01.curve')]

### Try it yourself!

In [26]:
mysyn = wn.synsets('wind')
mysyn

[Synset('wind.n.01'),
 Synset('wind.n.02'),
 Synset('wind.n.03'),
 Synset('wind.n.04'),
 Synset('tip.n.03'),
 Synset('wind_instrument.n.01'),
 Synset('fart.n.01'),
 Synset('wind.n.08'),
 Synset('weave.v.04'),
 Synset('wind.v.02'),
 Synset('wind.v.03'),
 Synset('scent.v.02'),
 Synset('wind.v.05'),
 Synset('wreathe.v.03'),
 Synset('hoist.v.01')]

In [27]:
mylist = [syn.lemma_names() for syn in wn.synsets('wind') ]
mylist

[['wind', 'air_current', 'current_of_air'],
 ['wind'],
 ['wind'],
 ['wind', 'malarkey', 'malarky', 'idle_words', 'jazz', 'nothingness'],
 ['tip', 'lead', 'steer', 'confidential_information', 'wind', 'hint'],
 ['wind_instrument', 'wind'],
 ['fart', 'farting', 'flatus', 'wind', 'breaking_wind'],
 ['wind', 'winding', 'twist'],
 ['weave', 'wind', 'thread', 'meander', 'wander'],
 ['wind', 'twist', 'curve'],
 ['wind', 'wrap', 'roll', 'twine'],
 ['scent', 'nose', 'wind'],
 ['wind', 'wind_up'],
 ['wreathe', 'wind'],
 ['hoist', 'lift', 'wind']]

In [30]:
str(mylist[0])

"['wind', 'air_current', 'current_of_air']"

In [32]:
mystrings = [" ".join(line) for line in mylist]
mystrings

['wind air_current current_of_air',
 'wind',
 'wind',
 'wind malarkey malarky idle_words jazz nothingness',
 'tip lead steer confidential_information wind hint',
 'wind_instrument wind',
 'fart farting flatus wind breaking_wind',
 'wind winding twist',
 'weave wind thread meander wander',
 'wind twist curve',
 'wind wrap roll twine',
 'scent nose wind',
 'wind wind_up',
 'wreathe wind',
 'hoist lift wind']

In [31]:
"|||".join(mylist[0])

'wind|||air_current|||current_of_air'

In [35]:
with open('mytext.txt', 'w') as f:
    f.write("Just some text\n2nd line")

In [33]:
with open('mysyn.txt', 'w') as f:
    f.writelines(mystrings)

In [34]:
with open('mysyn2.txt', 'w') as f:
    for line in mystrings:
        f.write(line+"\n")

---

**Additional WordNet examples:**
* https://www.nltk.org/book/ch02#wordnet