# Text Corpora and Lexical Resources

based on the NLTK book:

["Accessing Text Corpora and Lexical Resources"](https://www.nltk.org/book/ch02.html)

In [None]:
import nltk

## NLTK Text Corpora

NLTK includes many text collections (corpora) and other language resources, listed here: http://www.nltk.org/nltk_data/

Additional information:
* [NLTK Corpus How-to](http://www.nltk.org/howto/corpus.html)

In order to use these resources you may need to download them using `nltk.download()`

---

**NLTK book: ["Text Corpus Structure"](https://www.nltk.org/book/ch02#text-corpus-structure)**

There are different types of corpora:
* simple collections of text (e.g. Gutenberg corpus)
* categorized (texts are grouped into categories that might correspond to genre, source, author)
* temporal, demonstrating language use over a time period (e.g. news texts)

![Types of NLTK corpora](https://www.nltk.org/images/text-corpus-structure.png)

There are also annotated text corpora that contain linguistic annotations, representing POS tags, named entities, semantic roles, etc. 

### 1) Gutenberg Corpus

NLTK includes a small selection of texts (= multiple files) from the Project Gutenberg electronic text archive:

In [None]:
# let's explore its contents:

nltk.corpus.gutenberg.fileids()

In [None]:
# "Emma" by Jane Austen

emma = nltk.corpus.gutenberg.words('austen-emma.txt')

print(emma)

In [None]:
# you can access corpus texts as characters, words (tokens) or sentences:

file_id = 'austen-emma.txt'

print("\nSentences:")
print( nltk.corpus.gutenberg.sents(file_id)[:3] )

print("\nWords:")
print( nltk.corpus.gutenberg.words(file_id)[:10] )

print("\nChars:")
print( nltk.corpus.gutenberg.raw(file_id)[:50] )

See https://www.nltk.org/book/ch02#gutenberg-corpus on how to compute statistics of words, sentences and characters (e.g. avg words per sentence).

---

### 2) Brown corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.

This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on.

In [None]:
# Brown corpus categories list:

from nltk.corpus import brown
brown.categories()

In [None]:
# We can filter the corpus by (a) one or more categories or (b) file IDs:

print(brown.sents(categories='science_fiction')[:2])

In [None]:
print(brown.sents(categories=['news', 'editorial', 'reviews']))

In [None]:
print(brown.words(fileids=['cg22']))

We can use NLTK **ConditionalFreqDist** to collect statistics on the corpus distribution across genres and other properties:

In [None]:
cfd = nltk.ConditionalFreqDist(
           (genre, word)
           for genre in brown.categories()
           for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']

cfd.tabulate(conditions=genres, samples=modals)

#### Brown corpus contains tags with part-of-speech information

[Working with Tagged Corpora](https://www.nltk.org/book/ch05#tagged-corpora) (NLTK book)

In [None]:
words = nltk.corpus.brown.tagged_words(tagset='universal')
words

In [None]:
# islice() lets us read a part of the corpus

from itertools import islice
words = islice(words, 300)

# let's convert it to a list
word_list = list(words)
word_list

In [None]:
# find all words with POS tag "ADJ"

tag = 'ADJ'

[item[0] for item in word_list if item[1] == tag]

**Additional examples** (using FreqDist, ...):
    
[Working with Tagged Corpora](https://www.nltk.org/book/ch05#tagged-corpora)

### 3) NLTK Corpus functionality

* fileids()  = the files of the corpus
* fileids([categories])  = the files of the corpus corresponding to these categories

* categories()  = the categories of the corpus
* categories([fileids])  = the categories of the corpus corresponding to these files

* raw()  = the raw content of the corpus
* raw(fileids=[f1,f2,f3])  = the raw content of the specified files
* raw(categories=[c1,c2])  = the raw content of the specified categories

* words()  = the words of the whole corpus
* words(fileids=[f1,f2,f3])  = the words of the specified fileids
* words(categories=[c1,c2])  = the words of the specified categories

* sents()  = the sentences of the whole corpus
* sents(fileids=[f1,f2,f3])  = the sentences of the specified fileids
* sents(categories=[c1,c2])  = the sentences of the specified categories

* abspath(fileid)  = the location of the given file on disk
* encoding(fileid)  = the encoding of the file (if known)
* open(fileid)  = open a stream for reading the given corpus file
* root  = if the path to the root of locally installed corpus

* readme()  = the contents of the README file of the corpus


**Note: if you want to explore these corpora using `nltk.Text` functionality (e.g. as in the Introduction part) you will need to load them into `nltk.Text`**

# Your turn!

Choose one of NLTK corpora and **explore it using NLTK** (following examples here and in the NLTK book).

Also apply what you learned (FreqDist, ...) in section "Computing with Language: Statistics".

---

**Write code in notebook cells below**.
* add more cells (use "+" icon) if necessary

## Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions.

https://www.nltk.org/book/ch02#lexical-resources

We already used NLTK lexical resources (stopwords and common English words).

## WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets. 

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
# a collection of synonym sets related to "wind"

wn.synsets('wind')

In [None]:
# words (lemmas) in one of synsets:

wn.synset('wind.n.08').lemma_names()

In [None]:
wn.synset('wind.n.08').definition()

In [None]:
wn.synset('wind.n.08').examples()

In [None]:
# let's explore all the synsets for this word

for synset in wn.synsets('wind'):
    print(synset.lemma_names())

In [None]:
# see all synsets that contain a given word

wn.lemmas('curve')

### Try it yourself!

---

**Additional WordNet examples:**
* https://www.nltk.org/book/ch02#wordnet