# Corpus and Lexicon

## Objectives
- Understanding: 
    - relation between corpus and lexicon
    - effects of pre-processing (tokenization) on lexicon
    
- Learning how to:
    - load basic corpora for processing
    - compute basic descriptive statistic of a corpus
    - building lexicon and frequency lists from a corpus
    - perform basic lexicon operations
    - perform basic text pre-processing (tokenization and sentence segmentation) using python libraries

### Recommended Reading
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- Steven Bird, Ewan Klein, and Edward Loper. [__Natural Language Processing with Python__ (NLTK)](https://www.nltk.org/book/)

### Covered Material
- SLP
    - [Chapter 2: Regular Expressions, Text Normalization, Edit Distance](https://web.stanford.edu/~jurafsky/slp3/2.pdf) 
- NLTK 
    - [Chapter 2: Accessing Text Corpora and Lexical Resources](https://www.nltk.org/book/ch02.html)
    - [Chapter 3: Processing Raw Text](https://www.nltk.org/book/ch03.html)

### Requirements

- [NLTK](http://www.nltk.org/)
    - run `pip install nltk`
    
- [spaCy](https://spacy.io/)
    - run `pip install spacy`
    - run `python -m spacy download en_core_web_sm` to install English language model (`spacy>=3.0`)

- [scikit-learn](https://scikit-learn.org/)
    - run `pip install scikit-learn`
    

## 1. Corpora and Counting

### 1.1. Corpus

[Corpus](https://en.wikipedia.org/wiki/Text_corpus) is a collection of written or spoken texts that is used for language research. Before doing anything with a corpus we need to know its properties:

__Corpus Properties__:
- *Format* -- how to read/load it?
- *Language* -- which tools/models can I use?
- *Annotation* -- what it is intended for?
- *Split* for __Evaluation__: (terminology varies from source to source)

| Set         | Purpose                                       |
|:------------|:----------------------------------------------|
| Training    | training model, extracting rules, etc.        |
| Development | tuning, optimization, intermediate evaluation |
| Test        | final evaluation (remains unseen)             |


#### 1.1.1. Text Corpora in NLTK
NLTK provides several corpora with loading functions. Plain text corpora come from a _Project Gutenberg_.

`nltk.corpus.gutenberg.fileids()` lists available books.

In [1]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\adnan\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adnan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

#### 1.1.2. Units of Text Corpus
Depending on a goal, corpus can be seen as a sequence of:
- characters
- words (tokens)
- sentences
- paragraphs
- document

Each level, in turn, can be seen as a sequence of elements of the previous level.

- word -- a sequence of characters
- sentence -- a sequence of words
- paragraph -- a sequence of sentences
- document -- a sequence of paragraphs (or sentences, depending on our purpose)

#### 1.1.3. Loading NLTK Corpora

NLTK provides functions to load a corpus using these different levels, as `raw` (characters), `words`, and `sentences`.

In [5]:
alice_chars = nltk.corpus.gutenberg.raw('carroll-alice.txt')
print('chars:', alice_chars[0])
alice_words = nltk.corpus.gutenberg.words('carroll-alice.txt')
print('words:', alice_words[0])
alice_sents = nltk.corpus.gutenberg.sents('carroll-alice.txt')
print('sents:', alice_sents[0])

chars: [
words: [
sents: ['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']']


### 1.2. Corpus Descriptive Statistics (Counting)

*Corpus* can be described in terms of:

- total number of characters
- total number of words (_tokens_: includes punctuation, etc.)
- total number of sentences

- minimum/maximum/average number of character per token
- minimum/maximum/average number of words per sentence
- minimum/maximum/average number of sentences per document


__Example__

$$\text{Av. Token Count} = \frac{\text{count}(tokens)}{\text{count}(sentences)}$$


In [6]:
# let's compute average sentence length & round to closes integer
round(len(alice_words)/len(alice_sents))

20

In [12]:
# let's compute length of each sentence
sent_lens = [len(sent) for sent in alice_sents]
# let's compute length of each word
word_lens = [len(word) for word in alice_words]
# let's compute length the number of characters in each sentence
chars_lens = [len(''.join(sent)) for sent in alice_sents]

avg_sent_len = round(sum(sent_lens)/len(sent_lens))
min_sent_len = min(sent_lens)
max_sent_len = max(sent_lens)
print("AVG sent len", avg_sent_len)
print("MIN sent len", min_sent_len)
print("MAX sent len", max_sent_len)


AVG sent len 20
MIN sent len 2
MAX sent len 204


In [9]:
# JOIN built-in function example
tmp = ['H', 'e', 'l', 'l', 'o']
print(''.join(tmp))
print('⭐'.join(tmp))

Hello
H⭐e⭐l⭐l⭐o


#### Exercise

- Define a function to compute corpus descriptive statistics

    - input:
        - raw text (Chars)
        - words
        - sentences
    - output (print): 
        - average number of:
            - chars per word
            - words per sentence
            - chars per sentence
        - Size of the longest word and sentence


In [51]:
def statistics(chars, words, sents):
    word_lens = [len(word) for word in words]
    sent_lens = [len(sent) for sent in sents]
    chars_in_sents = [len(''.join(sent)) for sent in sents]
    
    word_per_sent = round(sum(sent_lens) / len(sents))
    char_per_word = round(sum(word_lens) / len(words))
    char_per_sent = round(sum(chars_in_sents) / len(sents))
    
    longest_sentence = max(sent_lens)
    longest_word = max(word_lens)
    
    return word_per_sent, char_per_word, char_per_sent, longest_sentence, longest_word

word_per_sent, char_per_word, char_per_sent, longest_sent, longeset_word = statistics(alice_chars, alice_words, alice_sents)

print('Word per sentence', word_per_sent)
print('Char per word', char_per_word)
print('Char per sentence', char_per_sent)
print('Longest sentence', longest_sent)
print('Longest word', longeset_word)

Word per sentence 20
Char per word 3
Char per sentence 68
Longest sentence 204
Longest word 14


## 2. Lexicon

[Lexicon](https://en.wikipedia.org/wiki/Lexicon) is the *vocabulary* of a language. In linguistics, a lexicon is a language's inventory of lexemes.

Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalog of a language's words; and a grammar, a system of rules which allow for the combination of those words into meaningful sentences. 

*Lexicon (or Vocabulary) Size* is one of the statistics reported for corpora. While *Word Count* is the number of __tokens__, *Lexicon Size* is the number of __types__ (unique words).


### 2.1. Lexicon and Its Size

#### 2.1.1. Constructing Lexicon and Computing its Size

Since lexicon is a list of unique elemets, it is a `set` of corpus words (i.e. tokens).
Consequently, its size is the size of the set.

In [21]:
alice_lexicon = set(alice_words)
len(alice_lexicon)

3016

__NOTE__:
We did not process our corpus in any way. Consequently, words with case variations are different entries in our lexicon.

In [22]:
print('ALL' in alice_lexicon)
print('All' in alice_lexicon)
print('all' in alice_lexicon)

True
True
True


#### 2.1.2. Lowercased Lexicon
Let's lowercase our corpus and re-compute the lexicon size.

In [40]:
alice_lexicon = set([w.lower() for w in alice_words])
len(alice_lexicon)

2636

In [24]:
print('ALL' in alice_lexicon)
print('All' in alice_lexicon)
print('all' in alice_lexicon)

False
False
True


### 2.2. Frequency List

In Natural Language Processing (NLP), [a frequency list](https://en.wikipedia.org/wiki/Word_lists_by_frequency) is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

What is a "word"?

- case sensitive counts
- case insensitive counts

#### 2.2.1. Computing Frequency List with python

In python, frequency list can be constructed in several ways. The most convenient is the `Counter`.

In [13]:
from collections import Counter
alice_freq_list = Counter(alice_words)

In [14]:
print(alice_freq_list.get('ALL', 0))
print(alice_freq_list.get('All', 0))
print(alice_freq_list.get('all', 0))

4
5
173


#### 2.2.2. Computing Frequency List with NLTK
NLTK provides `FreqDist` class to construct a Frequency List (`FreqDist` == _Frequency Distribution_)

In [15]:
alice_freq_dist = nltk.FreqDist(alice_words)

In [16]:
print(alice_freq_dist.get('ALL', 0))
print(alice_freq_dist.get('All', 0))
print(alice_freq_dist.get('all', 0))

4
5
173


#### Exercise

- compute frequency list of __lowercased__ "alice" corpus (you can use either method)
- report `5` most frequent words (use can use provided `nbest` function to get a dict of top N items)
- compare the frequencies to the reference values below

| Word   | Frequency |
|--------|----------:|
| ,      |     1,993 |
| '      |     1,731 |
| the    |     1,642 |
| and    |       872 |
| .      |       764 |


In [38]:
def nbest(d, n=1):
    """
    get n max values from a dict
    :param d: input dict (values are numbers, keys are stings)
    :param n: number of values to get (int)
    :return: dict of top n key-value pairs
    """
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

In [60]:
alice_lowercase_freq_list = Counter([word.lower() for word in alice_words])
nbest(alice_lowercase_freq_list, 5)

{',': 1993, "'": 1731, 'the': 1642, 'and': 872, '.': 764}

### 2.3. Lexicon Operations

It is common to process the lexicon according to the task at hand (not every transformation makes sense for all tasks). The common operations are removing words by frequency (minimum or maximum, i.e. *Frequency Cut-Off*) and removing words for a specific lists (i.e. *Stop Word Removal*).

#### 2.3.1. Frequency Cut-Off

##### Exercise

<!-- - define a function to compute a lexicon from a frequency list applying minimum and maximum frequency cut-offs
    
    - input: frequence list (dict)
    - output: list
    - use default values for min and max
     -->
- Using the function cut_off
    
    - compute lexicon applying:
    
        - minimum cut-off 2 (remove words that appear less than 2 times, i.e. remove [hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon))
        - maximum cut-off 100 (remove words that appear more that 100 times)
        - both minimum and maximum thresholds together
        
    - report size for each comparing to the reference values in the table (on the lowercased lexicon)

| Operation  | Min | Max | Size |
|------------|----:|----:|-----:|
| original   | N/A | N/A | 2636 |
| cut-off    |   2 | N/A | 1503 |
| cut-off    | N/A | 100 | 2586 |
| cut-off    |   2 | 100 | 1453 |


In [69]:
def cut_off(vocab, n_min=100, n_max=100):
    new_vocab = []
    for word, count in vocab.items():
        if count >= n_min and count <= n_max:
            new_vocab.append(word)
    return new_vocab

lower_bound = float(2) # Change these two number to compute the required cut offs
upper_bound = float(100)
lexicon_cut_off = len(cut_off(alice_lowercase_freq_list, n_min=lower_bound, n_max=upper_bound))

print('Original', len(alice_lowercase_freq_list))
print('CutOFF Min:', lower_bound, 'MAX:', upper_bound, ' Lexicon Size:', lexicon_cut_off)

Original 2636
CutOFF Min: 2.0 MAX: 100.0  Lexicon Size: 1453


#### 2.3.2. StopWord Removal

In computing, [stop words](https://en.wikipedia.org/wiki/Stop_words) are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose.

Let's check the stop word lists from the popular python libraries.

- spaCy
- NLTK
- scikit-learn

    
For NLTK we need to download them first

```python
import nltk
nltk.download('stopwords')
```

In [17]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adnan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [63]:
from spacy.lang.en.stop_words import STOP_WORDS as SPACY_STOP_WORDS
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as SKLEARN_STOP_WORDS
from nltk.corpus import stopwords

NLTK_STOP_WORDS = set(stopwords.words('english'))

print('spaCy: {}'.format(len(SPACY_STOP_WORDS)))
print('NLTK: {}'.format(len(NLTK_STOP_WORDS)))
print('sklearn: {}'.format(len(SKLEARN_STOP_WORDS)))


spaCy: 326
NLTK: 179
sklearn: 318


##### Exercise
- using Python's built it `set` [methods](https://docs.python.org/2/library/stdtypes.html#set):
    - compute the intersection between the 100 most frequent words in frequency list of the alice corpus and the list of stopwords (report count)
    - remove stopwords from the lexicon
    - print the size of:
            - original lexicon
            - lexicon without stopwords
            - overlap between 100 most freq. words and stopwords

| Operation       | Size |
|-----------------|-----:|
| original        | 2636 |
| no stop words   | 2490 |
| top 100 overlap |   65 |

In [70]:
# Set built-in Function
set_a = set(['a', 'b', 'c', 'd', 'e'])
set_b = set(['a', 'b', 'f'])

print(set_a.intersection(set_b)) # Compute overlap
print(set_a.difference(set_b)) # Remove Elements by computing the set diff

{'b', 'a'}
{'e', 'c', 'd'}


In [76]:
alice_vocab = set([w.lower() for w in alice_words])
top100 = list(nbest(alice_lowercase_freq_list, n=100).keys())
stop_words = NLTK_STOP_WORDS
overlap = set(top100).intersection(set(stop_words)) # Compute the intersation between top100 and stop_words
alice_vocab_no_stopwords = alice_vocab.difference(set(stop_words))
print('Original', len(alice_vocab))
print('No stopwords', len(alice_vocab_no_stopwords))
print('To100 overlap', len(overlap))

Original 2636
No stopwords 2490
To100 overlap 65


## 3. Basic Text Pre-processing

Both frequency cut-off and stop word removal are frequently used text pre-processing steps. Depending on the application, there are several other common text pre-processing steps that are usually applied for tranforming text for Machine Learning tasks.

__Text Normalization Steps__

- removing extra white spaces

- tokenization
    - documents to sentences (sentence segmentation/tokenization)
    - sentences to tokens

- lowercasing/uppercasing


- removing punctuation

- removing accent marks and other diacritics 

- removing stop words (see above)

- removing sparse terms (frequency cut-off)

- number normalization
    - numbers to words (i.e. `10` to `ten`)
    - number words to numbers (i.e. `ten` to `10`)
    - removing numbers

- verbalization (specifically for speech applications)

    - numbers to words
    - expanding abbreviations (or spelling out)
    - reading out dates, etc.
    

- [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation)
    - the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

- [stemming](https://en.wikipedia.org/wiki/Stemming)
    - the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.


### 3.1. Tokenization and Sentence Segmentation

Given a "clean" text, in order to allow any analysis, we need to identify its units.
In other words, we need to _segment_ the text into sentences and words.

__NOTE__:
Since both _tokenization_ and _sentence segmentation_ are automatic, different tools yield different results.

#### 3.1.1. Tokenization and Sentence Segmentation with spaCy
The default spaCy NLP pipeline does several processing steps including __tokenization__, *part of speech tagging*, lemmatization, *dependency parsing* and *Named Entity Recognition* (ignore the ones in *italics* for today). 


SpaCy produces a `Doc` object that contains `Span`s (sentences) and `Token`s (words).

In [18]:
import spacy
import en_core_web_sm
#nlp = en_core_web_sm.load()
# un-comment the lines above, if you get 'ModuleNotFoundError'
nlp = spacy.load("en_core_web_sm")
txt = alice_chars

In [19]:
# process the document
doc = nlp(txt, disable=["tok2vec", "tagger", "ner"])



In [91]:
print("first token: '{}'".format(doc[0]))
print("first sentence: '{}'".format(list(doc.sents)[0]))

first token: '['
first sentence: '[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!
Oh dear! I shall be late!' (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time

In [92]:
# access list of tokens (Token objects)
print(len(doc))
# access list of sentences (Span objects)
print(len(list(doc.sents)))

37033
1


#### 3.1.2. Tokenization and Sentence Segmentation with NLTK
NLTK's [tokenize](https://www.nltk.org/api/nltk.tokenize.html) package provides similar functionality using the methods below.

- `word_tokenize` 
- `sent_tokenize`

There are several tokenizer available (read documentation for more information).

In [83]:
# download NLTK tokenizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adnan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [84]:
alice_words_nltk = nltk.word_tokenize(alice_chars)
alice_sents_nltk = nltk.sent_tokenize(alice_chars)
print(len(alice_words_nltk))
print(len(alice_sents_nltk))

33494
1625


In [85]:
print("first token: '{}'".format(alice_words_nltk[0]))
print("first sentence: '{}'".format(alice_sents_nltk[0]))

first token: '['
first sentence: '[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.'


## Lab Exercise
- Load another corpus from Gutenberg (e.g. `milton-paradise.txt`)
- Compute descriptive statistics on the __reference__ (.raw, words, etc.) sentences and tokens.
- Compute descriptive statistics in the __automatically__ processed corpus
    - both with `spacy` and `nltk`
- Compute lowercased lexicons for all 3 (reference, spacy and nltk) versions of the corpus
    - compare lexicon sizes
- Compute frequency distribution for all 3 (reference, spacy and nltk) versions of the corpus
    - compare top N frequencies

# Solution

In [2]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\adnan\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adnan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 1. Load another corpus from Gutenberg (e.g. milton-paradise.txt)

In [22]:
whitman_chars = nltk.corpus.gutenberg.raw('whitman-leaves.txt')
print('chars:', whitman_chars[0])
whitman_words = nltk.corpus.gutenberg.words('whitman-leaves.txt')
print('words:', whitman_words[0])
whitman_sents = nltk.corpus.gutenberg.sents('whitman-leaves.txt')
print('sents:', whitman_sents[0])

chars: [
words: [
sents: ['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', '1855', ']']


## 2. Compute descriptive statistics on the reference (.raw, words, etc.) sentences and tokens.

In [23]:
def statistics(chars, words, sents):
    word_lens = [len(word) for word in words]
    sent_lens = [len(sent) for sent in sents]
    chars_len_in_sents = [len(''.join(sent)) for sent in sents]

    # total number of characters
    total_num_chars = len(chars)
    # total number of words (tokens: includes punctuation, etc.)
    total_num_words = len(words)
    # total number of sentences
    total_num_sents = len(sents)

    # minimum/maximum/average number of character per token
    min_char_per_token = min(word_lens)
    max_char_per_token = max(word_lens)
    avg_char_per_token = round(sum(word_lens) / len(words))

    # minimum/maximum/average number of words per sentence
    min_word_per_sent = min(sent_lens)
    max_word_per_sent = max(sent_lens)
    avg_word_per_sent = round(sum(sent_lens) / len(sents))

    # minimum/maximum/average number of sentences per document
    min_sent_per_doc = min(chars_len_in_sents)
    max_sent_per_doc = max(chars_len_in_sents)
    avg_sent_per_doc = round(sum(chars_len_in_sents) / len(chars_len_in_sents))

    return total_num_chars, total_num_words, total_num_sents, min_char_per_token, max_char_per_token, avg_char_per_token, min_word_per_sent, max_word_per_sent, avg_word_per_sent, min_sent_per_doc, max_sent_per_doc, avg_sent_per_doc

In [24]:
# calling the statistics function
total_num_chars_, total_num_words_, total_num_sents_, min_char_per_token_, max_char_per_token_, avg_char_per_token_, \
    min_word_per_sent_, max_word_per_sent_, avg_word_per_sent_, min_sent_per_doc_, max_sent_per_doc_, \
    avg_sent_per_doc_ = statistics(whitman_chars, whitman_words, whitman_sents)

In [25]:
print('Number of characters:', total_num_chars_)
print('Number of words:', total_num_words_)
print('Number of sentences:', total_num_sents_)
print()

print('Minimum number of characters per token:', min_char_per_token_)
print('Maximum number of characters per token:', max_char_per_token_)
print('Average number of characters per token:', avg_char_per_token_)
print()

print('Minimum number of words per sentence:', min_word_per_sent_)
print('Maximum number of words per sentence:', max_word_per_sent_)
print('Average number of words per sentence:', avg_word_per_sent_)
print()

print('Minimum number of sentences per document:', min_sent_per_doc_)
print('Maximum number of sentences per document:', max_sent_per_doc_)
print('Average number of sentences per document:', avg_sent_per_doc_)

Number of characters: 711215
Number of words: 154883
Number of sentences: 4250

Minimum number of characters per token: 1
Maximum number of characters per token: 16
Average number of characters per token: 4

Minimum number of words per sentence: 2
Maximum number of words per sentence: 1378
Average number of words per sentence: 36

Minimum number of sentences per document: 3
Maximum number of sentences per document: 5280
Average number of sentences per document: 135


## 3. Compute descriptive statistics in the automatically processed corpus
		both with spacy and nltk

### a) Descriptive Statistics with NLTK

In [27]:
# Convert war text into sentences and create document(s) using NLTK
nltk.download('punkt')  # download the required tokenizer

sentences = nltk.sent_tokenize(whitman_chars)
documents = [' '.join(sentences[i:i+10]) for i in range(0, len(sentences), 30)]  # create documents of 30 sentences each
documents

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adnan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


["[Leaves of Grass by Walt Whitman 1855]\n\n\nCome, said my soul,\nSuch verses for my Body let us write, (for we are one,)\nThat should I after return,\nOr, long, long hence, in other spheres,\nThere to some group of mates the chants resuming,\n(Tallying Earth's soil, trees, winds, tumultuous waves,)\nEver with pleas'd smile I may keep on,\nEver and ever yet the verses owning--as, first, I here and now\nSigning for Soul and Body, set to them my name,\n\nWalt Whitman\n\n\n\n[BOOK I.  INSCRIPTIONS]\n\n}  One's-Self I Sing\n\nOne's-self I sing, a simple separate person,\nYet utter the word Democratic, the word En-Masse. Of physiology from top to toe I sing,\nNot physiognomy alone nor brain alone is worthy for the Muse, I say\n    the Form complete is worthier far,\nThe Female equally with the Male I sing. Of Life immense in passion, pulse, and power,\nCheerful, for freest action form'd under the laws divine,\nThe Modern Man I sing.\n\n\n\n} As I Ponder'd in Silence\n\nAs I ponder'd in sil

In [28]:
# Initialize variables for statistics
num_characters = 0
num_tokens = 0
num_sentences = 0
num_docs = len(documents)
char_per_token = []
words_per_sentence = []
sentences_per_doc = []

# Process each document in the corpus
for doc_text in documents:
    # Tokenize the text into sentences and words
    sentences = nltk.sent_tokenize(doc_text)
    words = nltk.word_tokenize(doc_text)

    # Calculate statistics for the current document
    num_characters += len(doc_text)
    num_tokens += len(words)
    num_sentences += len(sentences)

    char_per_token += [len(token) for token in words]
    words_per_sentence += [len(nltk.word_tokenize(sent)) for sent in sentences]
    sentences_per_doc.append(len(sentences))

# Compute statistics
avg_char_per_token = sum(char_per_token) / len(char_per_token)
avg_words_per_sentence = sum(words_per_sentence) / len(words_per_sentence)
avg_sentences_per_doc = sum(sentences_per_doc) / len(sentences_per_doc)

print(f"Total number of characters: {num_characters}")
print(f"Total number of tokens: {num_tokens}")
print(f"Total number of sentences: {num_sentences}")
print()

print(f"Minimum number of characters per token: {min(char_per_token)}")
print(f"Maximum number of characters per token: {max(char_per_token)}")
print(f"Average number of words per sentence: {avg_words_per_sentence:.2f}")
print()

print(f"Minimum number of sentences per document: {min(sentences_per_doc)}")
print(f"Maximum number of sentences per document: {max(sentences_per_doc)}")
print(f"Average number of sentences per document: {avg_sentences_per_doc:.2f}")

Total number of characters: 223068
Total number of tokens: 47107
Total number of sentences: 1280

Minimum number of characters per token: 1
Maximum number of characters per token: 20
Average number of words per sentence: 36.80

Minimum number of sentences per document: 10
Maximum number of sentences per document: 10
Average number of sentences per document: 10.00


### b) Descriptive Statistics with Spacy

In [9]:
import spacy

# Convert war text into sentences and create document(s) using Spacy
nlp = spacy.load('en_core_web_sm')  # load the English language model

doc = nlp(whitman_chars)
documents = [sent.text for sent in doc.sents]  # create a list of sentences
documents

['[Leaves of Grass by Walt Whitman 1855]\n\n\nCome, said my soul,\nSuch verses for my Body let us write, (for we are one,)\n',
 "That should I after return,\nOr, long, long hence, in other spheres,\nThere to some group of mates the chants resuming,\n(Tallying Earth's soil, trees, winds, tumultuous waves,)\nEver with pleas'd smile I may keep on,\nEver and ever yet the verses owning--as, first, I here and now\nSigning for Soul and Body, set to them my name,\n\nWalt Whitman\n\n\n\n[BOOK I.  INSCRIPTIONS]\n\n}  One's-Self I Sing\n\nOne's-self I sing, a simple separate person,\nYet utter the word Democratic, the word En-Masse.\n\n",
 'Of physiology from top to toe I sing,\nNot physiognomy alone nor brain alone is worthy for the Muse, I say\n    the Form complete is worthier far,\nThe Female equally with the Male I sing.\n\n',
 "Of Life immense in passion, pulse, and power,\nCheerful, for freest action form'd under the laws divine,\nThe Modern Man I sing.\n\n\n\n}  ",
 "As I Ponder'd in Sile

In [10]:
# Initialize variables for statistics
num_characters = 0
num_tokens = 0
num_sentences = 0
num_docs = len(documents)
char_per_token = []
words_per_sentence = []
sentences_per_doc = []

# Process each document in the corpus
for doc_text in documents:
    doc = nlp(doc_text)

    # Calculate statistics for the current document
    num_characters += len(doc_text)
    num_tokens += len(doc)
    num_sentences += len(list(doc.sents))

    char_per_token += [len(token) for token in doc if not token.is_space]
    words_per_sentence += [len(sent) for sent in doc.sents]
    sentences_per_doc.append(len(list(doc.sents)))

# Compute statistics
avg_char_per_token = sum(char_per_token) / len(char_per_token)
avg_words_per_sentence = sum(words_per_sentence) / len(words_per_sentence)
avg_sentences_per_doc = sum(sentences_per_doc) / len(sentences_per_doc)

# Print results
print(f"Total number of characters: {num_characters}")
print(f"Total number of tokens: {num_tokens}")
print(f"Total number of sentences: {num_sentences}")

print()
print(f"Minimum number of characters per token: {min(char_per_token)}")
print(f"Maximum number of characters per token: {max(char_per_token)}")
print(f"Average number of characters per token: {avg_char_per_token:.2f}")

print()
print(f"Minimum number of words per sentence: {min(words_per_sentence)}")
print(f"Maximum number of words per sentence: {max(words_per_sentence)}")
print(f"Average number of words per sentence: {avg_words_per_sentence:.2f}")

print()
print(f"Minimum number of sentences per document: {min(sentences_per_doc)}")
print(f"Maximum number of sentences per document: {max(sentences_per_doc)}")
print(f"Average number of sentences per document: {avg_sentences_per_doc:.2f}")

Total number of characters: 710469
Total number of tokens: 165896
Total number of sentences: 4012

Minimum number of characters per token: 1
Maximum number of characters per token: 21
Average number of characters per token: 3.78

Minimum number of words per sentence: 1
Maximum number of words per sentence: 1174
Average number of words per sentence: 41.35

Minimum number of sentences per document: 1
Maximum number of sentences per document: 3
Average number of sentences per document: 1.02


## 3. Compute lowercased lexicons for all 3 (reference, spacy and nltk) versions of the corpus
compare lexicon sizes

In [11]:
# Constructing Lexicon

# reference corpus
whitman_words_lower = [w.lower() for w in whitman_words]
text = " ".join(whitman_words_lower)

reference_lower_lexicons = set(whitman_words_lower)

# spacy corpus
nlp = spacy.load('en_core_web_sm')
whitman_words_spacy = nlp(text)
spacy_lower_lexicons = set(whitman_words_spacy)

# nltk corpus
whitman_words_nltk = nltk.word_tokenize(text)
nltk_lower_lexicons = set(whitman_words_nltk)

In [12]:
# Computing lexicon Sizes
# reference corpus size
print('Reference corpus size:')
print('Number of words:', len(reference_lower_lexicons))

# spacy corpus size
print('Spacy corpus size:')
print('Number of words:', len(spacy_lower_lexicons))

# nltk corpus size
print('Nltk corpus size:')
print('Number of words:', len(nltk_lower_lexicons))

Reference corpus size:
Number of words: 12452
Spacy corpus size:
Number of words: 155417
Nltk corpus size:
Number of words: 12421


## 4.Compute frequency distribution for all 3 (reference, spacy and nltk) versions of the corpus
compare top N frequencies

#### Calculating frequency ddistribution

In [13]:
# reference corpus
whitman_words_lower_freq = nltk.FreqDist(whitman_words)

# spacy corpus
whitman_words_spacy_freq = nltk.FreqDist(whitman_words_spacy)

# nltk corpus
whitman_words_nltk_freq = nltk.FreqDist(whitman_words_nltk)

In [14]:
# let N is 10
N = 10

In [15]:
# getting top N frequencies
# reference corpus
print('Reference corpus top 10 frequencies:')
print(whitman_words_lower_freq.most_common(N))

Reference corpus top 10 frequencies:
[(',', 17713), ('the', 8814), ('and', 4797), ('of', 4127), ('I', 2932), ("'", 2362), ('to', 1930), ('-', 1774), ('.', 1769), ('in', 1714)]


In [16]:
# spacy corpus
print('Spacy corpus top 10 frequencies:')
print(whitman_words_spacy_freq.most_common(N))

Spacy corpus top 10 frequencies:
[([, 1), (leaves, 1), (of, 1), (grass, 1), (by, 1), (walt, 1), (whitman, 1), (1855, 1), (], 1), (come, 1)]


In [17]:
# nltk corpus
print('Nltk corpus top 10 frequencies:')
print(whitman_words_nltk_freq.most_common(N))

Nltk corpus top 10 frequencies:
[(',', 17936), ('the', 10113), ('and', 5334), ('of', 4265), ('i', 2933), ("'", 2374), ('to', 2244), ('.', 1890), ('in', 1875), ('-', 1774)]
