# Miniconda Installation

1. Go [here](https://docs.conda.io/en/latest/miniconda.html#latest-miniconda-installer-links) to download the Miniconda 3 Installer for your operating system
2. Once the download has finished run the installer
3. Follow the installation prompts
4. Update Miniconda3:
    * MacOS: In the Terminal `conda update conda`
    * Windows: In Anaconda Powershell Prompt (miniconda3) execute `conda update conda`
5. Agree to the installation prompts

# Environment Setup


1. Create and name a new Python environment `conda create -n introNLTK python`
2. Activate the environment `conda activate introNLTK`
3. Install Jupyter Notebook via pip `pip install notebook`
4. Run Jupyter Notebook `jupyter notebook`
5. Quick tour of Jupyter Notebook
6. From home page find and click on the `Quit` button to shutdown the server

# NLTK Setup

1. Get back into your introNLTK environment `conda activate introNLTK`
2. Install nltk `pip install nltk`
3. Run Jupyter Notebook `jupyter notebook`
4. Make a new notebook file: New > Python3
5. Add `nltk` to your current Python session: `import nltk`
6. Run the `nltk` downloader: `nltk.download()`, eventually you should see this window

![nltk-downloader-3.png](attachment:nltk-downloader-3.png)

7. Select `book` and click `Download`
8. When it is done close the Window

# Getting started



In [1]:
from nltk.book import * # import everything from the NLTK Book

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [97]:
text1 # lets start with Moby Dick

<Text: Moby Dick by Herman Melville 1851>

In [98]:
type(text1) # check to see the datatype of text`

nltk.text.Text

We can see what we can do with an `nltk` `Text` object by typing `text1.` and then hitting TAB

In [None]:
text1.

Using the up and down arrow keys on your keyboard find the method `concordance`, select it, and then type a question mark at the end. Once you are done, Shift enter to pull that method up in the documentation

In [11]:
text1.concordance?

A `concordance` view displays every occurrence of a particular word. Let's try an obvious one first: whale

In [12]:
text1.concordance("whale")

Displaying 25 of 1226 matches:
s , and to teach them by what name a whale - fish is to be called in our tongue
t which is not true ." -- HACKLUYT " WHALE . ... Sw . and Dan . HVAL . This ani
ulted ." -- WEBSTER ' S DICTIONARY " WHALE . ... It is more immediately from th
ISH . WAL , DUTCH . HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALE
HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALEINE , FRENCH . BALLE
least , take the higgledy - piggledy whale statements , however authentic , in 
 dreadful gulf of this monster ' s ( whale ' s ) mouth , are immediately lost a
 patient Job ." -- RABELAIS . " This whale ' s liver was two cartloads ." -- ST
 Touching that monstrous bulk of the whale or ork we have received nothing cert
 of oil will be extracted out of one whale ." -- IBID . " HISTORY OF LIFE AND D
ise ." -- KING HENRY . " Very like a whale ." -- HAMLET . " Which to secure , n
restless paine , Like as the wounded whale to shore flies thro ' the maine ." -
. OF SPER

In [6]:
text1.concordance("monstrous") # a different word

Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u


We can also ask `nltk` to show us other words which appear in the same contexts as a particular word. It's not so interesting to do this with "whale" but lets try "monstrous"

In [14]:
text1.similar("monstrous")

true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless


# Gathering data

Let's say we want to see how long a particular text is. One approach would be to use Python's `len` function.

In [99]:
len(text1)

260819

So Moby Dick has 260k "tokens," or a sequence of characters. To get a slightly more usable picture of our corpus we can use `set()`, a Python object that builds an unordered collection of unique elements (i.e. it will remove all duplicate tokens).

In [100]:
set(text1)

{'basso',
 'residue',
 'rifles',
 'intelligence',
 'grimness',
 'loveliest',
 'kindle',
 'northwards',
 'hooking',
 'rehearsed',
 'scrupulous',
 'piping',
 'Fernandes',
 'intercedings',
 'terse',
 'fires',
 'sites',
 'completed',
 'Deck',
 'bushy',
 'defyingly',
 'imagines',
 'everyway',
 'geometry',
 'fumbled',
 'Forecastle',
 'stomachs',
 'grip',
 'forks',
 'simply',
 'Phrenologist',
 'chest',
 'varying',
 'Coopman',
 'trading',
 'coach',
 'dancing',
 'blind',
 'Manillas',
 'pond',
 'Duke',
 'acquaintances',
 'visits',
 'floor',
 'juicy',
 'position',
 'Constantinople',
 'HE',
 'surging',
 'PREVIOUS',
 'tackle',
 'barnacled',
 'incorrigible',
 'Physeter',
 'Abraham',
 'enabled',
 'woodpecker',
 'unenervated',
 'magistrate',
 'braver',
 'keeling',
 'sleeper',
 'religious',
 'controversies',
 'troughs',
 'inactive',
 'dough',
 'sloping',
 'leaping',
 'hideous',
 'SIZED',
 'waning',
 'leans',
 'restore',
 'nigh',
 'undiluted',
 'OR',
 'SONG',
 'wines',
 'knows',
 'Canals',
 'Halloo',
 '

The Python object `sorted()` helps us organize the output above such that we can more easily scan through it.

In [101]:
sorted(set(text1))

['!',
 '!"',
 '!"--',
 "!'",
 '!\'"',
 '!)',
 '!)"',
 '!*',
 '!--',
 '!--"',
 "!--'",
 '"',
 '"\'',
 '"--',
 '"...',
 '";',
 '$',
 '&',
 "'",
 "',",
 "',--",
 "'-",
 "'--",
 "';",
 '(',
 ')',
 '),',
 ')--',
 ').',
 ').--',
 '):',
 ');',
 ');--',
 '*',
 ',',
 ',"',
 ',"--',
 ",'",
 ",'--",
 ',)',
 ',*',
 ',--',
 ',--"',
 ",--'",
 '-',
 '--',
 '--"',
 "--'",
 '--\'"',
 '--(',
 '---"',
 '---,',
 '.',
 '."',
 '."*',
 '."--',
 ".'",
 '.\'"',
 '.)',
 '.*',
 '.*--',
 '.,',
 '.--',
 '.--"',
 '...',
 '....',
 '.]',
 '000',
 '1',
 '10',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '11',
 '110',
 '111',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '118',
 '119',
 '12',
 '120',
 '121',
 '122',
 '123',
 '124',
 '125',
 '126',
 '127',
 '128',
 '129',
 '13',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '14',
 '144',
 '1492',
 '15',
 '150',
 '15th',
 '16',
 '1652',
 '1668',
 '1671',
 '1690',
 '1695',
 '16th',
 '17',
 '1726',
 '1729',
 '1750',
 '1772',
 '1775

So `sorted()` helps us see a lot of data in our text that we may or may not want to pay attention to. For example, there are lots of sets of symbols (including punctuation) and lots of sets of numbers. If we wanted to count the number of **unique words** in Moby Dick we definitely do not want symbols or punctuation and we may or may not want numbers.

In the next sequence we will try to get a cleaner (i.e. sanitized) data set. We assume that, for our purposes, we do not care about sets of symbols (punctuation) or numbers so we want an easy way to eliminate them from our list. We run the previous line of code once more but store it to a variable (`x`), we then use a `list comprehension` and `enumerate()` to help us figure out where our unique word list actually starts.

In [102]:
x = sorted(set(text1))
indexed = [[num, token] for num,token in enumerate(x)]

In [103]:
indexed # lets check our work

[[0, '!'],
 [1, '!"'],
 [2, '!"--'],
 [3, "!'"],
 [4, '!\'"'],
 [5, '!)'],
 [6, '!)"'],
 [7, '!*'],
 [8, '!--'],
 [9, '!--"'],
 [10, "!--'"],
 [11, '"'],
 [12, '"\''],
 [13, '"--'],
 [14, '"...'],
 [15, '";'],
 [16, '$'],
 [17, '&'],
 [18, "'"],
 [19, "',"],
 [20, "',--"],
 [21, "'-"],
 [22, "'--"],
 [23, "';"],
 [24, '('],
 [25, ')'],
 [26, '),'],
 [27, ')--'],
 [28, ').'],
 [29, ').--'],
 [30, '):'],
 [31, ');'],
 [32, ');--'],
 [33, '*'],
 [34, ','],
 [35, ',"'],
 [36, ',"--'],
 [37, ",'"],
 [38, ",'--"],
 [39, ',)'],
 [40, ',*'],
 [41, ',--'],
 [42, ',--"'],
 [43, ",--'"],
 [44, '-'],
 [45, '--'],
 [46, '--"'],
 [47, "--'"],
 [48, '--\'"'],
 [49, '--('],
 [50, '---"'],
 [51, '---,'],
 [52, '.'],
 [53, '."'],
 [54, '."*'],
 [55, '."--'],
 [56, ".'"],
 [57, '.\'"'],
 [58, '.)'],
 [59, '.*'],
 [60, '.*--'],
 [61, '.,'],
 [62, '.--'],
 [63, '.--"'],
 [64, '...'],
 [65, '....'],
 [66, '.]'],
 [67, '000'],
 [68, '1'],
 [69, '10'],
 [70, '100'],
 [71, '101'],
 [72, '102'],
 [73, '103'],
 

So we should be able to split (or slice) our `indexed` list at 279 and throw out all of the symbols and numbers. Slicing lists in python is stupid easy.

In [104]:
indexed = indexed[279:]

In [105]:
indexed # lets check out work

[[279, 'A'],
 [280, 'ABOUT'],
 [281, 'ACCOUNT'],
 [282, 'ADDITIONAL'],
 [283, 'ADVANCING'],
 [284, 'ADVENTURES'],
 [285, 'AFFGHANISTAN'],
 [286, 'AFRICA'],
 [287, 'AFTER'],
 [288, 'AGAINST'],
 [289, 'AHAB'],
 [290, 'ALFRED'],
 [291, 'ALGERINE'],
 [292, 'ALIVE'],
 [293, 'ALL'],
 [294, 'ALONE'],
 [295, 'AM'],
 [296, 'AMERICA'],
 [297, 'AMONG'],
 [298, 'ANCHORS'],
 [299, 'AND'],
 [300, 'ANGLO'],
 [301, 'ANIMAL'],
 [302, 'ANNALS'],
 [303, 'ANNUS'],
 [304, 'ANOTHER'],
 [305, 'ANY'],
 [306, 'APOLOGY'],
 [307, 'APPLICATION'],
 [308, 'APPROACHING'],
 [309, 'ARCTIC'],
 [310, 'ARE'],
 [311, 'AROUND'],
 [312, 'AS'],
 [313, 'ASCENDING'],
 [314, 'ASIA'],
 [315, 'ASIDE'],
 [316, 'ASPECT'],
 [317, 'AT'],
 [318, 'ATTACK'],
 [319, 'ATTACKED'],
 [320, 'ATTITUDES'],
 [321, 'AUGUST'],
 [322, 'AUTHOR'],
 [323, 'AZORE'],
 [324, 'Abashed'],
 [325, 'Abednego'],
 [326, 'Abel'],
 [327, 'Abjectus'],
 [328, 'Aboard'],
 [329, 'Abominable'],
 [330, 'About'],
 [331, 'Above'],
 [332, 'Abraham'],
 [333, 'Academy'],
 [

In [106]:
len(indexed)

19038

We are getting closer, but you may notice that some words begin with a capital letter, some are capitalized entirely, and some are lowercase. It kind of begs the question: are there duplicate words that `set` did not catch because their capitalization is different? Hard to say. I suppose we could check manually but that would take a long time and be a bummer, right?

Fortunately Python has a way to help us with this. Lets start by getting rid of the indices in our 2D (two dimensional) `list`...we only did that so we could identify where words started in our sorted list, we don't need that information anymore.

In [107]:
words = [] # declare a list object called words

for i in indexed:
    print(i[1]) # so here we print every word without the index for funsies
    words.append(i[1].lower()) # here we convert every word to lowercase and add it to our new list


A
ABOUT
ACCOUNT
ADDITIONAL
ADVANCING
ADVENTURES
AFFGHANISTAN
AFRICA
AFTER
AGAINST
AHAB
ALFRED
ALGERINE
ALIVE
ALL
ALONE
AM
AMERICA
AMONG
ANCHORS
AND
ANGLO
ANIMAL
ANNALS
ANNUS
ANOTHER
ANY
APOLOGY
APPLICATION
APPROACHING
ARCTIC
ARE
AROUND
AS
ASCENDING
ASIA
ASIDE
ASPECT
AT
ATTACK
ATTACKED
ATTITUDES
AUGUST
AUTHOR
AZORE
Abashed
Abednego
Abel
Abjectus
Aboard
Abominable
About
Above
Abraham
Academy
Accessory
According
Accordingly
Accursed
Achilles
Actium
Acushnet
Adam
Adieu
Adios
Admiral
Admirals
Advance
Advancement
Adventures
Adverse
Advocate
Affected
Affidavit
Affrighted
Afric
Africa
African
Africans
Aft
After
Afterwards
Again
Against
Agassiz
Ages
Ah
Ahab
Ahabs
Ahasuerus
Ahaz
Ahoy
Ain
Air
Akin
Alabama
Aladdin
Alarmed
Alas
Albatross
Albemarle
Albert
Albicore
Albino
Aldrovandi
Aldrovandus
Alexander
Alexanders
Alfred
Algerine
Algiers
Alike
Alive
All
Alleghanian
Alleghanies
Alley
Almanack
Almighty
Almost
Aloft
Alone
Alps
Already
Also
Am
Ambergriese
Ambergris
Amelia
America
American
Americans
Amer

In [108]:
words # checking out work

['a',
 'about',
 'account',
 'additional',
 'advancing',
 'adventures',
 'affghanistan',
 'africa',
 'after',
 'against',
 'ahab',
 'alfred',
 'algerine',
 'alive',
 'all',
 'alone',
 'am',
 'america',
 'among',
 'anchors',
 'and',
 'anglo',
 'animal',
 'annals',
 'annus',
 'another',
 'any',
 'apology',
 'application',
 'approaching',
 'arctic',
 'are',
 'around',
 'as',
 'ascending',
 'asia',
 'aside',
 'aspect',
 'at',
 'attack',
 'attacked',
 'attitudes',
 'august',
 'author',
 'azore',
 'abashed',
 'abednego',
 'abel',
 'abjectus',
 'aboard',
 'abominable',
 'about',
 'above',
 'abraham',
 'academy',
 'accessory',
 'according',
 'accordingly',
 'accursed',
 'achilles',
 'actium',
 'acushnet',
 'adam',
 'adieu',
 'adios',
 'admiral',
 'admirals',
 'advance',
 'advancement',
 'adventures',
 'adverse',
 'advocate',
 'affected',
 'affidavit',
 'affrighted',
 'afric',
 'africa',
 'african',
 'africans',
 'aft',
 'after',
 'afterwards',
 'again',
 'against',
 'agassiz',
 'ages',
 'ah',


Okay so remember, I am guessing that this matters so lets see if I am correct. We ask Python if the length of the original version of `words`, the one where all words are converted to lowercase but I do not do any other processing, is equal to the length of the same list **after** it has been run through `set()` (remember: `set()` eliminates duplicates). My assumption is the former has duplicates and the latter does not. If I am correct the length of each should not be equal and a boolean comparison should return `False`.

In [109]:
len(words) == len(set(words))

False

Okay so it looks like I was right, right? Let's see if we can spot some duplicates in a sorted version of one that are not in a sorted version of the other.

In [110]:
dupes = sorted(words)
unique = sorted(set(words))

In [111]:
dupes

['[',
 ']',
 '_____________',
 'a',
 'a',
 'aback',
 'abaft',
 'abandon',
 'abandoned',
 'abandonedly',
 'abandonment',
 'abased',
 'abasement',
 'abashed',
 'abashed',
 'abate',
 'abated',
 'abatement',
 'abating',
 'abbreviate',
 'abbreviation',
 'abeam',
 'abed',
 'abednego',
 'abel',
 'abhorred',
 'abhorrence',
 'abhorrent',
 'abhorring',
 'abide',
 'abided',
 'abiding',
 'ability',
 'abjectly',
 'abjectus',
 'able',
 'ablutions',
 'aboard',
 'aboard',
 'abode',
 'abominable',
 'abominable',
 'abominate',
 'abominated',
 'abomination',
 'aboriginal',
 'aboriginally',
 'aboriginalness',
 'abortion',
 'abortions',
 'abound',
 'abounded',
 'abounding',
 'aboundingly',
 'about',
 'about',
 'about',
 'above',
 'above',
 'abraham',
 'abreast',
 'abridged',
 'abroad',
 'abruptly',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absorbed',
 'absorbing',
 'absorbingly',
 'abstained',
 'abstemious',
 'abstinence',
 'abstract',
 'abstracted',
 'abstraction',
 'absurd',
 'absurdly',
 'abu

In [112]:
unique

['[',
 ']',
 '_____________',
 'a',
 'aback',
 'abaft',
 'abandon',
 'abandoned',
 'abandonedly',
 'abandonment',
 'abased',
 'abasement',
 'abashed',
 'abate',
 'abated',
 'abatement',
 'abating',
 'abbreviate',
 'abbreviation',
 'abeam',
 'abed',
 'abednego',
 'abel',
 'abhorred',
 'abhorrence',
 'abhorrent',
 'abhorring',
 'abide',
 'abided',
 'abiding',
 'ability',
 'abjectly',
 'abjectus',
 'able',
 'ablutions',
 'aboard',
 'abode',
 'abominable',
 'abominate',
 'abominated',
 'abomination',
 'aboriginal',
 'aboriginally',
 'aboriginalness',
 'abortion',
 'abortions',
 'abound',
 'abounded',
 'abounding',
 'aboundingly',
 'about',
 'above',
 'abraham',
 'abreast',
 'abridged',
 'abroad',
 'abruptly',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absorbed',
 'absorbing',
 'absorbingly',
 'abstained',
 'abstemious',
 'abstinence',
 'abstract',
 'abstracted',
 'abstraction',
 'absurd',
 'absurdly',
 'abundance',
 'abundant',
 'abundantly',
 'academy',
 'accelerate',
 'accelera

Cool, if you scroll through `dupes` and `unique` you can see that I was in fact right but also another issue has crept in: some symbols are back. No worries, its pretty easy to eliminate those with string slicing...

In [113]:
unique = unique[3:]

In [114]:
unique

['a',
 'aback',
 'abaft',
 'abandon',
 'abandoned',
 'abandonedly',
 'abandonment',
 'abased',
 'abasement',
 'abashed',
 'abate',
 'abated',
 'abatement',
 'abating',
 'abbreviate',
 'abbreviation',
 'abeam',
 'abed',
 'abednego',
 'abel',
 'abhorred',
 'abhorrence',
 'abhorrent',
 'abhorring',
 'abide',
 'abided',
 'abiding',
 'ability',
 'abjectly',
 'abjectus',
 'able',
 'ablutions',
 'aboard',
 'abode',
 'abominable',
 'abominate',
 'abominated',
 'abomination',
 'aboriginal',
 'aboriginally',
 'aboriginalness',
 'abortion',
 'abortions',
 'abound',
 'abounded',
 'abounding',
 'aboundingly',
 'about',
 'above',
 'abraham',
 'abreast',
 'abridged',
 'abroad',
 'abruptly',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absorbed',
 'absorbing',
 'absorbingly',
 'abstained',
 'abstemious',
 'abstinence',
 'abstract',
 'abstracted',
 'abstraction',
 'absurd',
 'absurdly',
 'abundance',
 'abundant',
 'abundantly',
 'academy',
 'accelerate',
 'accelerated',
 'accelerating',
 'accep

Okay so now lets see how many words are in Moby Dick

In [115]:
len(unique)

16951

A little less than 17k. Cool.

As you can see we have to make decisions about what does and does not matter to us when we are analyzing a text. Though I suspect there are still some problems in the output that may impact the results of my analysis (for example: is the word "chapter" used to denote the beginning of a chapter? Is Melville's name in my output? etc.) I am now much closer to accurately answering the question "how many unique words are in Moby Dick?"

There are tradeoffs, though. Because I spent all that time eliminating duplicate words I would have to start over from scratch in order to answer different questions that I might use `nltk` for, most notably I cannot use `FreqDist` to see what the ten most frequently used words in Moby Dick are.