# Corpora

In [47]:
import nltk

If you are working locally you could download parts or the whole of NLTK, including corpora, with the following code

In [48]:
#nltk.download()

In [49]:
nltk.download('gutenberg')
from nltk.corpus import gutenberg

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


The function `fileids()` returns a list of the files in a corpus

In [50]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [51]:
len(gutenberg.fileids())

18

The function `sents` returns the sentences of a file as a list. Each sentence is representend as an inner list of the obtained tokens after tokenization with the [Punkt](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) tokenizer. We will use it to obtain the sentences from Jane Austen's "Emma". See also the [source](https://www.gutenberg.org/ebooks/158) of the book.   

In [52]:
nltk.download('punkt_tab')
sents = gutenberg.sents('bible-kjv.txt')
print(len(sents))
print(sents)
print(sents[3])

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


30103
[['[', 'The', 'King', 'James', 'Bible', ']'], ['The', 'Old', 'Testament', 'of', 'the', 'King', 'James', 'Bible'], ...]
['1', ':', '1', 'In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']


The function `words` returns the words of a file as a list. We will use it to obtain the words from Jane Austen's "Emma".

In [53]:
words = gutenberg.words('bible-kjv.txt')
print(len(words))
print(words)
print(words[11:20])

1010654
['[', 'The', 'King', 'James', 'Bible', ']', 'The', ...]
['King', 'James', 'Bible', 'The', 'First', 'Book', 'of', 'Moses', ':']


The function `raw` returns the raw text of a file as a list. We will use it to obtain the raw text from Jane Austen's "Emma".

In [54]:
# raw gets the number of characters
raw = gutenberg.raw('bible-kjv.txt')
print(len(raw))
print(raw[1:200])

4332554
The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without 


What is the average length of sentences in number of tokens, and of tokens in number of characters, in "Emma"?

In [55]:
# That is assuming that tokens are defined as words
sent_length = len(words)/len(sents)
print("Average number of tokens in sentences %.2f" %(sent_length))
word_length = len(raw)/len(words)
print("Average number of characters in tokens %.2f" %(word_length))

Average number of tokens in sentences 33.57
Average number of characters in tokens 4.29


Relationship between types and instances

In [56]:
import math
# %d (of integer) is to replace an integer
print("instances (n for total length): %d" %(len(words)))
# The set() function in Python is used to create a set, which is an unordered collection of unique elements
print("types (V for distinct words): %d" %(len(set(words))))
print("types to instances ratio: %.4f" %(len(set(words))/len(words)))
print("square root of instances: %d" %(math.sqrt(len(words))))
print("Does K fall between the typical 10 and 100?")
print("Assuming b=0.05 => K = V/sqrt(n) = %d" %(len(set(words)) / math.sqrt(len(words))))

instances (n for total length): 1010654
types (V for distinct words): 13769
types to instances ratio: 0.0136
square root of instances: 1005
Does K fall between the typical 10 and 100?
Assuming b=0.05 => K = V/sqrt(n) = 13


# Segmenting words in running text (tokenization)

Split text into paragraphs using regular expressions

In [57]:
import re
# slice the string
part = raw[50:1301]
print(part)
# "\n\n" matches two consecutive newline characters, which often indicate a paragraph break in text files
paragraphs = re.split("\n\n", part)
print("\nParagraphs")
for paragraph in paragraphs:
  print("---")
  print(paragraph)


ing James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light
from the darkness.

1:5 And God called the light Day, and the darkness he called Night.
And the evening and the morning were the first day.

1:6 And God said, Let there be a firmament in the midst of the waters,
and let it divide the waters from the waters.

1:7 And God made the firmament, and divided the waters which were
under the firmament from the waters which were above the firmament:
and it was so.

1:8 And God called the firmament Heaven. And the evening and the
morning were the second day.

1:9 And God said, Let the waters under the heaven be gathered together
unto one place, and let th

Remove line change characters with regular expressions

In [59]:
print(paragraphs[2])
print("---")
print(re.sub("\n", " ", paragraphs[2]))


1:1 In the beginning God created the heaven and the earth.
---
 1:1 In the beginning God created the heaven and the earth.


Word tokenization with regular expressions

In [60]:
def word_tokenization_with_regex(text):
  "Split a text string into a list of words."
  # initiate our counters
  words = []
  start = 0
  # Loops over each character in text - texts are broken into characters
  for pos, char in enumerate(text):
    # if the word matches any of these
    if re.match('[,;. ]', char):
      # take the slice up to the previous character
      word = text[start: pos]
      # and add it to the list
      words.append(word)
      # increase start by 1 if that character was blank
      if char == ' ':
        # move the start forward to start grabbing the next word
        start = pos + 1
      else:
        start = pos
  # add punctuation as a separate word
  if re.match('[,;.]', text[-1]):
    words.append(text[-1])
  return words

print(paragraphs[2])
words = word_tokenization_with_regex(paragraphs[2])
print("---")
for word in words:
  print(word)


1:1 In the beginning God created the heaven and the earth.
---

1:1
In
the
beginning
God
created
the
heaven
and
the
earth
.


Word tokenization with NLTK

In [61]:
# same as before, using the built-in nltk function
from nltk import word_tokenize
words = word_tokenize(paragraphs[2])

print(paragraphs[2])
words = word_tokenize(paragraphs[2])
print("---")
for word in words:
  print(word)


1:1 In the beginning God created the heaven and the earth.
---
1:1
In
the
beginning
God
created
the
heaven
and
the
earth
.


Punctuation removal

In [62]:
# 'punctuation' provides a predefined string containing all the characters commonly considered punctuation
from string import punctuation
print(punctuation)
print("---")
print(paragraphs[2])
print("---")
words = word_tokenize(paragraphs[2])
print(len(words))
print(words)
print("---")
words_without_punctuation = [word for word in words if word not in punctuation]
print(len(words_without_punctuation))
print(words_without_punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
---

1:1 In the beginning God created the heaven and the earth.
---
12
['1:1', 'In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']
---
11
['1:1', 'In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth']


# Normalization (pre LLM era)

Lowercasing is achieved with the `lower` function of strings. It should take place after tokenization, as tokenizers use capitalization as cues to know when to split a paragraph into sentences or a sentence into words

In [16]:
print(paragraphs[0])
print("---")
from nltk import word_tokenize
words = word_tokenize(paragraphs[0])
words_lower = [word.lower() for word in words]
print(words_lower)

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
---
['emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.']


Stemming

In [17]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(paragraphs[0])
print("---")
for word in word_tokenize(paragraphs[0]):
  stem = ps.stem(word)
  if (word != stem):
    print("%s - %s" %(word, stem))

Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
---
Emma - emma
Woodhouse - woodhous
handsome - handsom
comfortable - comfort
happy - happi
disposition - disposit
seemed - seem
unite - unit
blessings - bless
existence - exist
lived - live
nearly - nearli
twenty-one - twenty-on
years - year
very - veri
little - littl


Lemmatization. One issue here is that NLTK's lemmatizer also requires the part-of-speech tag of a word to function properly. We will discuss this in a future lecture. By default it considers each word as noun ('n'). Adjectives, adverbs, nouns and verbs are defined by constants 'a', 'r', 'n', 'v' respectively.

In [18]:
nltk.download('omw-1.4')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
print(paragraphs[0])
print("---")
for word in word_tokenize(paragraphs[0]):
  lemma = wnl.lemmatize(word)
  if (lemma != word):
    print("%s - %s" %(word, lemma))

print(wnl.lemmatize("lives"))
print(wnl.lemmatize("lives", "v"))
print(wnl.lemmatize("lives", "n"))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.
---
blessings - blessing
years - year
life
live
life


# Segmenting sentences in running text

Split paragraph into sentences using regular expressions

In [19]:
def sentence_tokenization_with_regex(text):
  "Split a text string into a list of sentences."
  sentences = []
  start = 0
  for pos, char in enumerate(text):
    if re.match('[!?.]', char):
      sentence = text[start: pos+1]
      sentences.append(sentence)
      start = pos + 1
  return sentences

print(paragraphs[2])
par_sents = sentence_tokenization_with_regex(paragraphs[2])
for sent in par_sents:
  print("---")
  print(sent)

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.
---
Sixteen years had Miss Taylor been in Mr.
---
 Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.
---
  Between _them_ it was more the intimacy
of sisters.
---
  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authorit

Split paragraph into sentences using NLTK

In [20]:
print(paragraphs[2])
from nltk import sent_tokenize
par_sents = sent_tokenize(paragraphs[2])
for par in par_sents:
  print("---")
  print(par)

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
now long passed away, they had been living together as friend and
friend very mutually attached, and Emma doing just what she liked;
highly esteeming Miss Taylor's judgment, but directed chiefly by
her own.
---
Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.
---
Between _them_ it was more the intimacy
of sisters.
---
Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness of her temper had hardly allowed
her to impose any restraint; and the shadow of authority being
n