# Dictionaries and Tuples - Lab 4, Python for Text, Fall 2020

---




---

To be completed and submitted by end of the day on Monday, September 28.

The questions you need to answer are marked with **QUESTION**. For each one, there's a space (under **ANSWER**) for you to add your answer, which might be text, might be code, or might be a mix of the two. 



### Have questions?

1. Use the Canvas discussion boards.

## Reminder

You will need to submit your own notebook (or rather, a link to your own notebook). There are (at least) two ways to do this:

1. Make a copy of this notebook, rename it, and add new code cells when you want to write your own code.
1. Create a new Python 3 notebook using the `File` menu and type in all cells yourself. 

Either way, your file should be named using this format: `LastName_Py4Text_Lab4.ipynb`

# **Overview: Analyzing Text with Word Frequencies**

This week we will look at using word frequencies to analyze text. Frequency analysis is a cornerstone of text analytics, across fields from literary analysis to forensic linguistics to market research. For example, methods such as topic modeling, word embeddings, and sentiment analysis incorporate word frequencies as essential components. In addition, text analysis programs (e.g. tools like nVivo, MaxQDA, or others) generally feature word frequency analysis. This week you will learn how to use Python to code your own frequency analysis system. Building your own system of course takes more time than simply running an existing tool, but it comes with the great advantage of being able to customize your system to do exactly what you want it to do.

Along the way to building this system, you'll get practice working with two new **data structures** in python: dictionaries and tuples. You'll also start working with literary data from Project Gutenberg.

Things you should be able to do at the end of this lab:

* Read in literary texts (from the Gutenberg Corpus) from the NLTK, using the corpus reader methods introduced in Lab 3
* Download additional texts from the Project Gutenberg website and read those texts into Python
* Understand the difference between **types** and **tokens** and what it means to count each of these
* Understand why **case normalization** is important
* Understand and be able to use two new data structures: **dictionaries** and **tuples**
* Learn how to use NLTK's `FreqDist` class to do frequency analysis of texts
* Print results to a file using the `print` function




# 0. Preliminaries

To get started, import the NLTK and download the book data.

In [None]:
import nltk
nltk.download('book')  ## the 'book' download includes some texts from Project Gutenberg 
from nltk.book import *


[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    

# 1. Literature as text: Project Gutenberg

With this week's lab, we move into the use of literature as text. Our focus is on any works of fiction. Some people have strong opinions about what counts as *literature* - you can decide for yourselves whether you are one of those people. Despite having a B.A. in English literature, I am *not* one of those people - for the purposes of this class, we'll consider just about any fictional texts as relevant.

So why do we care about analyzing fictional texts? The answer to this depends on the point of view of the analyst. Some people study literature for literary purposes. Some studies have used text analysis for **authorship attribution** - determining the author(s) of previously anonymous texts. This type of analysis may also take place in forensic linguistics. One thing is certain - fictional texts have different properties than news texts. Imagine that you're given a randomly-selected paragraph of text, one with at least 5-6 sentences. Chances are quite high that you will be able to accurately determine whether that paragraph comes from news text or fiction. How do we make this determination? 

Let's try it. I've nearly-randomly selected paragraphs from two different texts, one from the news genre and one from the literary genre.

```
Paragraph One:
Children have the strangest adventures without being troubled by them.
For instance, they may remember to mention, a week after the event
happened, that when they were in the wood they had met their dead
father and had a game with him. It was in this casual way that Wendy one
morning made a disquieting revelation. Some leaves of a tree had been
found on the nursery floor, which certainly were not there when the
children went to bed, and Mrs. Darling was puzzling over them when Wendy
said with a tolerant smile:

“I do believe it is that Peter again!”
```

```
Paragraph Two:
Over the past two months, orcas have damaged about a dozen pleasure boats 
off the Iberian Peninsula from the Strait of Gibraltar to the coast of Galicia,
the most northerly point in Spain, baffling marine biologists and sailors.
Although there have been no reports of injuries — at least for humans — 
scientists and the Spanish authorities have struggled to interpret the 
interactions.
Were they attacks? Or just friendly encounters from a highly playful mammal 
that went a little too far?
```

### **QUESTION:**

> a. Which paragraph is news, and which is fiction?

> b. For each genre, describe at least 2-3 cues that helped you recognize the text type of the paragraph. Feel free to look at other examples of texts from these genres if you feel it would be helpful.

### **ANSWER:**

> I think paragraph one is fiction and paragraph two is a news text - paragraph two reads more informative while the first includes characteristics typical of a novel or work of fiction. The quotation marks around dialogue is a good clue for fiction although some news articles can introduce quotes in this manner as well - the use of descriptive storytelling through adjective and proper nouns is another clue for fiction. The news text reads strictly informative by first introducing the subject matter being reported on - orcas - then follows a brief background description on location, and the issue being reported on (boat damages). The presence of rhetorical questions can also be seen as a typical presence in most news articles as well.

Different text types use language differently, and this is important not only for manual analysis but also for automated analysis. For example, many tools for automated analysis of language are trained on large amounts of data - some examples are automatic part-of-speech taggers, text generation systems, and machine translation systems. Systems that are trained on only news data might do an excellent job of labeling other news data, but their performance gets noticeably worse when they are applied to text from other genres.

In machine learning, we talk about this as a matter of **domain shift** or **domain effects**, and many researchers have worked on **domain adaptation** methods to address the problem.

For this lab, we will stay within the domain of fiction. One problem that arises with fiction is the question of copyright, since so many fictional works are sold for profit, and are not considered free for re-distribution in the way that many (but not all) news texts are. 

Enter **Project Gutenberg**. Project Gutenberg is a large online collection of free ebooks. Project Gutenberg focuses on texts that are no longer subject to copyright protection, which means there's a lot of old literature available.
The [Project Gutenberg website](https://www.gutenberg.org/) hosts more than 60,000 freely-available books, in more than 60 different languages (mostly in English, though). We'll do more with the website in the next chunk of this lab. For now, we'll look at the Project Gutenberg texts available in the NLTK. (If you'd like to see more examples of how to use the `gutenberg` corpus within NLTK, read section 1.1 of Chapter 2 of the NLTK book: http://www.nltk.org/book/ch02.html.)

Remember the corpus readers introduced in the previous lab? We'll make use of these here too. As a refresher:

* `words()` -- returns the corpus text as a list of words
* `sents()` -- returns the corpus text as a list of sentences. Sometimes each sentence is in the form of a list of words, and sometimes each sentence is a string.
* `paras()` -- returns the corpus text as a list of paragraphs
* `raw()` -- returns the corpus text as one long string

In [None]:
### import the corpus directly
from nltk.corpus import gutenberg

### see a list of fileids
for text in gutenberg.fileids():
    print(text)

### how many texts?
print(len(gutenberg.fileids()))

### store the data in different ways, using the built-in corpus readers
guten_words = gutenberg.words() 
guten_sents = gutenberg.sents()
guten_paras = gutenberg.paras()
guten_raw = gutenberg.raw()

### how many words in these texts?
print(len(guten_words))



austen-emma.txt
austen-persuasion.txt
austen-sense.txt
bible-kjv.txt
blake-poems.txt
bryant-stories.txt
burgess-busterbrown.txt
carroll-alice.txt
chesterton-ball.txt
chesterton-brown.txt
chesterton-thursday.txt
edgeworth-parents.txt
melville-moby_dick.txt
milton-paradise.txt
shakespeare-caesar.txt
shakespeare-hamlet.txt
shakespeare-macbeth.txt
whitman-leaves.txt
18
2621613



Now we'll choose one text to work with. I'm going to use "Alice in Wonderland" - choose one of your own and try out each of the following steps on your own text. (Show these steps in new code blocks.)

We'll first store different versions of the text by using the different corpus readers. Notice that I've created a variable to store the fileid for "Alice in Wonderland" - this helps our code in several ways:

* it saves us having to type the filename over and over again
* it reduces the likelihood of typos (and makes them easier to fix)
* it makes the code easily reusable - we can just change the value stored to the variable and run the code again for a different text (instead of writing new code to do the same operations for our new text










In [None]:
### variable for the fileid
alice = 'carroll-alice.txt'

### different versions of the text
alice_raw = gutenberg.raw(alice)
alice_words = gutenberg.words(alice)
alice_sents = gutenberg.sents(alice)
alice_paras = gutenberg.paras(alice)

In [None]:
macbeth = 'shakespeare-macbeth.txt'
macbeth_raw = gutenberg.raw(macbeth)
macbeth_words = gutenberg.words(macbeth)
macbeth_sents = gutenberg.sents(macbeth)
macbeth_paras = gutenberg.paras(macbeth)

### **QUESTION:**

> Review of data types: For each of the variables listed below, look at the first one or two elements. Based on what you see, how is the data structured for that variable? For example, if I look at the first five elements of `alice_words`, using a slice (`alice_words[:5]`), I get the following output: 


In [None]:
print(alice_words[:5])

['[', 'Alice', "'", 's', 'Adventures']


In [None]:
print(alice_raw[:45])
print(alice_sents[:5])
print(alice_paras[:3])

[Alice's Adventures in Wonderland by Lewis Ca
[['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']'], ['CHAPTER', 'I', '.'], ['Down', 'the', 'Rabbit', '-', 'Hole'], ['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without', 'pictures', 'or', 'conversation', "?'"], ['So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',', 'for', 'the', 'hot', 'day', 'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid', '),', 'whether', 'the', 'pleasure', 'of', 'making', 'a', 'daisy', '-', 'chain', 'would',

> We can see by the fact that the output starts and ends with a square bracket that it's a **list**, and see that each element in the list is a **string**, so `alice_words` is a list of strings. (We also see that mostly the strings are words, though broken apart in some ways - we'll talk more about this process later in the semester.)

> Your task: how is the data structured for:

* a. `alice_raw`
* b. `alice_sents`
* c. `alice_paras`

### **ANSWER:**

> a. 'alice_raw' returns the full text in the corpora in a list, this isn't tokenized at all

>b. 'alice_sents' - a nested tokenized list with one sentence per [] list. 

>c. 'alice_paras' - similar to the sents function except the outpus is a list of the text section off by their paragraphs.

-----

# 2. Counting words, and building your own corpus in NLTK


## **Counting words in a corpus**

> *How many words are in my corpus?*

Although it seems like a simple question on the surface, there are many different ways to answer this question. In text analysis, we often distinguish between the **type** count and the **token** count.

Types are distinct words - so *cat* and *lion* are different word types. When we count types, we're counting how many different words appear in a corpus. 

Tokens are individual occurrences of words, so in a sentence like *Cats like to chase cats* we would count 4 types and 5 tokens.

**READING:** Read sections 2.2 and 2.3 of chapter 2 in Speech and Language Processing (3rd edition), available here: https://web.stanford.edu/~jurafsky/slp3/2.pdf

### **QUESTION:**

> In your own words, what is a *lemma*? To illustrate, choose a lemma in English. Write the lemma and at least 3 other forms of that lemma.

### **ANSWER:**

> A lemma is a unit of meaning which can index similar forms: so run is a lemma of (or indexes) run, running, runs.

-----

Once we have a list of all the words in a text, it's straightforward to get both the type count and the token count. For the token count, we simply compute the length of the list of words, keeping in mind that, for now at least, that list also include punctuation and other characters we may or may not want to include in the token count.

To get the type count, we can use a python function called `set()`. In fact, the set is yet another data structure in python. Just as in set theory, a set in python is a group (or set) of elements with no duplicates. When we run a command like the following:

```
alice_vocab = set(alice_words)
```

we are producing a new variable - a set of the unique word types (and punctuation types) in "Alice and Wonderland". We can again use the `len()` function to count the elements in the set.





In [None]:
### first let's create the set of word types
alice_vocab = set(alice_words)

### now we can get the counts
alice_token_count = len(alice_words)
alice_type_count = len(alice_vocab)
print("Tokens in AiW:", alice_token_count)
print("Types in AiW:", alice_type_count)

Tokens in AiW: 34110
Types in AiW: 3016


### **QUESTION:**

> What are the token and type counts for the Gutenberg text you selected? Include the code you use to compute these figures.

### **ANSWER:**

> The token counts for Macbeth is 23140 and 4017 for the type count

>macbeth_vocab = set(macbeth_words)
macbeth_tks = len(macbeth_words)
macbeth_typ = len(macbeth_vocab)
print("Tokens in Macbeth: ", macbeth_tks)
print("Types in Macbeth: ", macbeth_typ)

>Tokens in Macbeth:  23140
Types in Macbeth:  4017

-----




In [None]:
macbeth_vocab = set(macbeth_words)
macbeth_tks = len(macbeth_words)
macbeth_typ = len(macbeth_vocab)
print("Tokens in Macbeth: ", macbeth_tks)
print("Types in Macbeth: ", macbeth_typ)

Tokens in Macbeth:  23140
Types in Macbeth:  4017


The relationship between the type count and the token count tells us how varied the vocabulary of a text is - this measure is often referred to as **lexical diversity** or **lexical richness**. 

You can read more about this, and see ways of computing this measure, in section 1.4 of Chapter 1 of the NLTK book: http://www.nltk.org/book/ch01.html

### **QUESTION:**

> What is the lexical diversity score for your selected text? Include the code you use to compute these figures.

### **ANSWER:**

> The lexical diversity for Macbeth is around 60% - the number of distinct words in 60% of the total words

>def lexical_diversity(text):
  return(len(set(text))) / len(text)
lexical_diversity(macbeth)

-----



In [None]:
def lexical_diversity(text):
  return(len(set(text))) / len(text)
lexical_diversity(macbeth)

0.6086956521739131

Finally, when we perform these counts, we need to decide how we want to handle case. Specifically, do we want to have separate counts for the upper- and lower-case forms of a word? Do we want to count *Cat* separately from *cat*? Usually, the answer to that question is NO, and this is where **case normalization** comes into play. 

Case normalization simply means that we flatten out the case distinctions in a corpus or a text. There are more and less complicated ways to perform case normalization, and the right one to choose depends on what we want to do with the data. If we're interested in finding company names in texts, for example, we probably want to keep words that start with a capital letter, at least when they occur somewhere other than at the start of a sentence. The most typical strategy is to convert every word to all lower case. 

We can do this using python string methods and a list comprehension.


In [None]:
### list comprehension: applies lower() to every word in alice_words
alice_words_lower = [w.lower() for w in alice_words]

### token count hasn't changed
### but the casing has
print(len(alice_words_lower))
print(alice_words_lower[:10])

34110
['[', 'alice', "'", 's', 'adventures', 'in', 'wonderland', 'by', 'lewis', 'carroll']


Notice that this is a slightly different way to use list comprehensions. In Lab 3, we used list comprehensions to apply conditions to lists and create new lists with subsets of the previous lists. For example:

In [None]:
### make a list of all words in Alice that start with 'a'
alice_awords = [w for w in alice_words if w.startswith('a') or w.startswith('A')]

print(len(alice_awords))
print(alice_awords[:10])

3402
['Alice', 'Adventures', 'Alice', 'and', 'and', 'a', 'Alice', 'as', 'as', 'and']


Now, instead of applying a condition, we're applying a string method to *every* word in the text. We can also combine the two (transforming the list items and applying a condition).

In [None]:
### lower-case only words that begin with 'a'
new_alice = [w.lower() for w in alice_words if w.startswith('A')]

print(len(new_alice))
print(new_alice[:10])

566
['alice', 'adventures', 'alice', 'alice', 'alice', 'a', 'alice', 'alice', 'alice', 'alice']


### **QUESTION:**

> Create a case-normalized version of your text. Print out the first 10 words of the original version and the first 10 words of the case-normalized version. Include the code you used to do this.

### **ANSWER:**

> new_macbeth = [w.lower() for w in macbeth_words if w.startswith('M')]
print(len(macbeth_words))
print(macbeth_words[:10])
print(len(new_macbeth))
print(new_macbeth[:10])

In [None]:
new_macbeth = [w.lower() for w in macbeth_words if w.startswith('M')]
print(len(macbeth_words))
print(macbeth_words[:10])
print(len(new_macbeth))
print(new_macbeth[:10])

23140
['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']']
583
['macbeth', 'macbeth', 'malkin', 'malcome', 'mal', 'macdonwald', 'macbeth', 'minion', 'marke', 'macbeth']



## **Building your own corpus in NLTK**

Now we're going to see how to build our own corpus within NLTK. In the next lab we'll learn how to import our own texts into Python without using NLTK as an intermediary. Before proceeding, please read sections 1.8 and 1.9 in Chapter 2 of the NLTK book: http://www.nltk.org/book/ch02.html 

NLTK	gives	us	a	generic	corpus	reader	which	can	read	in	a	collection	of	text	files	and	let	you	work	with	them	using	NLTK’s	many	text	analysis	functions. The	function	we	will	use	is	NLTK’s	`PlaintextCorpusReader`, and for today we will download texts from the [Project Gutenberg website](https://www.gutenberg.org/).

1. We'll be building our corpus within Colab - this does mean you'll need to upload your texts each time you sign in from a new location, unless you create your corpus within Google Drive and mount your Drive. For now, though, here's the simplest way. Create a directory by clicking the folder icon to the left of your screen, right-clicking, and picking "New folder". Give it a name that has to do with your corpus, and no spaces in the name of the folder. I've named mine `corpus`. 

1. Using Gutenberg's search and/or browsing functions, choose at least three texts to add to your corpus. Download the version of the text that comes in `.txt` format - this is the `Plain text UTF-8` version. Open the plain text version, then right-click and save on your computer. (You can use text files from other sources too.)

1. Next, upload the text files to your new `corpus` directory on Colab. I've selected *Three Lives* by Gertrude Stein, *The Scarlet Letter* by Nathaniel Hawthorne, and *Jack the Giant Killer* by Percival Leigh.
  
1. Now	we	are	ready	to	work	with	our	texts	in	Python.	First	we	need	to	import	the	`PlaintextCorpusReader`.



In [None]:
from nltk.corpus import PlaintextCorpusReader

5. The next step is to tell Python the name of the directory in which our text files are stored. This is what’s known as the **root directory** for the corpus. I’m showing the path for my root directory. In Colab, this should be simple: just the name of the folder that you have uploaded your text files to, followed by a forward slash.

In [None]:
corpus_root = r"corpus/"

6. Now we will call the function `PlaintextCorpusReader`. This function has two arguments – (a) the root directory for the corpus; and (b) a list containing the names of the files we want to read into Python (each file name should occur as a string). I'm storing the output of the function to a variable which is essentially the name I've given to my corpus:

In [None]:
funtexts1 = PlaintextCorpusReader(corpus_root, ['jack.txt', 'scarlet.txt', 'threeLives.txt'])

7. Of course, we may not always want to list every file name in the corpus root directory. Instead of explicitly providing a list of filenames, another option is to give a pattern as the second argument. So if I want to add every file that ends in `txt` to my corpus, I can do the following instead. The pattern here uses regular expression language. (NOTE:	in	this	case,	`funtexts1` and `funtexts2` are	two	lists	with	the	same	content):

In [None]:
funtexts2 = PlaintextCorpusReader(corpus_root, '.*txt')
print(funtexts2.fileids())

['dracula.txt', 'frankie.txt', 'jekyll.txt']


We can use `fileids()` to see a list of the file names in our newly-created corpus.

Now try this with your own texts. From this point on, fill in the Python commands with words and variables relevant to your new corpus.

We can now see the list of file IDs in our corpus, and we can look at the word lists for individual files, using the `fileids` and `words` methods.

In [None]:
fun_words = funtexts2.words()
print(len(fun_words))

320241


We can also pull the words for just one text at a time, using the same approach that we have used for NLTK's small chunk of the `gutenberg` corpus.

In [None]:
scarlet_words = funtexts2.words(fileids='scarlet.txt')
print(len(scarlet_words))
print(scarlet_words[:10])

### **QUESTION:**

> Now comes the fun of working with your own texts! For the corpus that you created, please answer each of the following questions. Include the code you used to arrive at the answers.

* a. What are the titles and authors of the texts you selected?
* b. How many words are in each text? Give both the type count and the token count.
* c. What is the lexical diversity score for each text?

> Next, take a look at these results. Are there any surprises? What, if anything, can you learn from comparing the lexical diversity scores for the different texts?

> Can you see any problems with doing the analysis in this way? What simplifications or assumptions are we making that might need lead to a good analysis?

### **ANSWER:**

> a. The texts I've chosen are: Dracula by Bram Stoker, The Strange Case of Dr. Jekyll and Mr. Hyde by Robert Louis Stevenson and Frankenstein; Or The Modern Prometheus by Mary Shelley.

> b. Total words in the corpus: 320241
Total words in Dracula: 196321
Total words in Frankenstein: 89381
Total words in Dr. Jekyll and Mr. Hyde: 34539
Total type counts in Dracula: 10727
Total type counts in Frankenstein: 7906
Total type counts in Dr. Jekyll and Mr. Hyde: 4722

> c. Lexical diversity score for Dracula: 0.054640104726442915 Lexical diversity score for Frankenstein: 0.08845280316845862
Lexical diversity score for Dr. Jekyll and Mr. Hyde: 0.1367150178059585

> No real surprises here just some questions on the lexical diversity scores and which function to use that would be iterable with the PlaintextCorpusReader. I think there's probably a simpler way to calculate the lexical diversity score that doesn't feel so clunky 

In [None]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = r"/content/corpus"
spooky_txts = PlaintextCorpusReader(corpus_root, '.*txt')
print(spooky_txts.fileids())
total_words = len(spooky_txts.words())
print("Total words in the corpus:", total_words)

dracula_words = spooky_txts.words(fileids='dracula.txt')
frankie_words = spooky_txts.words(fileids='frankie.txt')
jekyll_words = spooky_txts.words(fileids='jekyll.txt')
print("Total words in Dracula:", len(dracula_words))
print("Total words in Frankenstein:", len(frankie_words))
print("Total words in Dr. Jekyll and Mr. Hyde:", len(jekyll_words))

dracula_vocab = set(dracula_words)
frankie_vocab = set(frankie_words)
jekyll_vocab = set(jekyll_words)
dracula_type = len(dracula_vocab)
frankie_type = len(frankie_vocab)
jekyll_type = len(jekyll_vocab)
print("Total type counts in Dracula:", dracula_type)
print("Total type counts in Frankenstein:", frankie_type)
print("Total type counts in Dr. Jekyll and Mr. Hyde:", jekyll_type)

dracula = PlaintextCorpusReader(corpus_root, fileids='dracula.txt')
frankie = PlaintextCorpusReader(corpus_root, fileids='frankie.txt')
jekyll = PlaintextCorpusReader(corpus_root, fileids='jekyll.txt')
#tried it this way first: lexical_diversity(dracula) but got a TypeError - PlaintextCorpusReader object is not iterable

dracula_div = len(set(dracula.words())) / len(dracula.words())
frankie_div = len(set(frankie.words())) / len(frankie.words())
jekyll_div = len(set(jekyll.words())) / len(jekyll.words())
#not sure if .raw() is the correct function for this...getting really low numbers, going to try .words() instead
print("Lexical diversity score for Dracula:", dracula_div)
print("Lexical diversity score for Frankenstein:", frankie_div)
print("Lexical diversity score for Dr. Jekyll and Mr. Hyde:", jekyll_div)


['dracula.txt', 'frankie.txt', 'jekyll.txt']
Total words in the corpus: 320241
Total words in Dracula: 196321
Total words in Frankenstein: 89381
Total words in Dr. Jekyll and Mr. Hyde: 34539
Total type counts in Dracula: 10727
Total type counts in Frankenstein: 7906
Total type counts in Dr. Jekyll and Mr. Hyde: 4722
Lexical diversity score for Dracula: 0.054640104726442915
Lexical diversity score for Frankenstein: 0.08845280316845862
Lexical diversity score for Dr. Jekyll and Mr. Hyde: 0.1367150178059585


# 3. Data structures: tuples and dictionaries

So far we have worked primarily with the data structures of strings and lists, and the recently-introduced sets. Now we're going to look at some additional data structures in Python: **tuples** and **dictionaries**.

**READINGS:**

* tuples: Python Crash Course chapter 4, pages 69-74
* dictionaries: Python Crash Course chapter 6, pages 95-115

Now that you've read about our two new data structures, let's spend a little time working with them. 

## **Tuples**

Tuples are very similar to lists, as both lists and tuples are ordered collections of values. Unlike lists, tuples are **immutable** - this means they (mostly) can't be changed once they've been established.

Tuples usually are displayed between parentheses, but there are in fact several different ways to create a new tuple, as shown below.



In [None]:
# different ways to create a tuple
empty_tuple = ()
simple_tuple = 'a', 'b', 'c' 
the_same_simple_tuple = ('a', 'b', 'c')
the_same_simple_tuple_again = tuple( ['a', 'b', 'c'] )
print( simple_tuple, the_same_simple_tuple, the_same_simple_tuple_again )

('a', 'b', 'c') ('a', 'b', 'c') ('a', 'b', 'c')


Now let's look at a use for tuples in text analysis. We're going to use tuples to represent two-word sequences (aka **bigrams**) in our texts, and we'll build a list of those bigram tuples to collect all of the two-word sequences in the text. We'll do this by taking the following steps:

1. Use a `for` loop to iterate through each word in the word list, up to the next-to-last word. We stop before the last word, because we are building two-word tuples, and there's no word coming after the last word.
1. Instead of using the words to drive the `for` loop, we'll use a numerical sequence. This way, we can use the number as the index of the first word in the sequence, and the number + 1 as the index of the second word in the sequence. This is accomplished using `range()` - the `range` function is discussed in Python Crash Course, pages 61-63. 
1. For each word pair, we'll create a tuple and add that tuple to a list of all tuples.

To illustrate, let's first do this for a short list of words.

In [None]:
### create a short list of words
mylist = ['the', 'cat', 'sat', 'on', 'the', 'mat', '.']

### create an empty list to store our bigrams
bigrams = []

### iterate through the list of words,
### creating a bigram tuple for each 2-word sequence,
### and adding that tuple to the list of bigrams

for i in range(len(mylist)-1):
    first = mylist[i]
    second = mylist[i+1]
    bigram = (first, second)
    print(bigram)
    mylist.append(bigram)

('the', 'cat')
('cat', 'sat')
('sat', 'on')
('on', 'the')
('the', 'mat')
('mat', '.')


Now we can do this for a longer text, and using code that's a bit more compact.

In [None]:
scarlet_bigrams = []

### using The Scarlet Letter
for i in range(len(scarlet_words)-1):
    bigram = (scarlet_words[i], scarlet_words[i+1])  ### two word sequence
    scarlet_bigrams.append(bigram)

### how many bigrams?
print(len(scarlet_bigrams))

### **QUESTION:**

> Build a list of bigrams for the gutenberg text you selected. Include your code. How many bigrams in your text?

### **ANSWER:**

> There are 23130 bigrams in Macbeth, code shown below.

-----


In [None]:
macbeth_bigrams = []
for i in range(len(macbeth_words)-1):
  bgm = (macbeth_words[i], macbeth_words[i+1])
  macbeth_bigrams.append(bgm)
print(macbeth_bigrams[:10])
len(macbeth_bigrams)

[('[', 'The'), ('The', 'Tragedie'), ('Tragedie', 'of'), ('of', 'Macbeth'), ('Macbeth', 'by'), ('by', 'William'), ('William', 'Shakespeare'), ('Shakespeare', '1603'), ('1603', ']'), (']', 'Actus')]


23139


## **Dictionaries**

Now that you've read all about dictionaries, let's see how we can use them. A first use of dictionaries is to keep track of how often each word in a text appears. Remember that dictionaries let us map a **key** to a **value** - in this case, the word will be the key, and the number of times it occurs will be the value. We could build this dictionary by using the `count()` function over and over, but it's more efficient to do this by iterating over a text and adding one to the value for the word each time it occurs. 

In [None]:
### First, initialize our dictionary
mydict = {}

### Next, we'll iterate through our text (as a list of words)
### For each word, we'll first check whether it's already in the dictionary
### If it is, we add one to the count for the word
### If it's not, we add it as a new key to the dictionary, with a value of 1
for w in scarlet_words:
    if w in mydict:                    ## check whether w is already in the dict
        mydict[w] = mydict[w] + 1      ## alternately: mydict[w] += 1
    else:
        mydict[w] = 1                  ## create a dictionary entry w/ value 1

### Recall that we get the value for a dictionary key by 
### looking it up, as if it were an index

print("The word 'scarlet' occurs", mydict['scarlet'], "times.")
print("The word 'letter' occurs", mydict['letter'], "times.")

### Compare to the counts we get using the count() function
print(scarlet_words.count("scarlet"))
print(scarlet_words.count("letter"))


### **QUESTION:**

> Now you try it. Create a count dictionary for the text you selected. Choose several interesting words and display how often they appear. Compare these counts to the ones you get using `count()`. Are they the same? Include the code you use.

### **ANSWER:**

> The word 'honor' appears 2 times.
The word 'blood' appears 15 times.
The word 'Macbeth' appears 61 times.
The word 'death' appears 9 times.
I got the same results using the count function: 2, 15, 61, 9
code shown below:

-----


In [None]:
new_dict = {}
for w in macbeth_words:
  if w in new_dict:
    new_dict[w] = new_dict[w] + 1
  else:
    new_dict[w] = 1

print("The word 'honor' appears", new_dict['honor'], "times.")
print("The word 'blood' appears", new_dict['blood'], "times.")
print("The word 'Macbeth' appears", new_dict['Macbeth'], "times.")
print("The word 'death' appears", new_dict['death'], "times.")
print(macbeth_words.count('honor'))
print(macbeth_words.count('blood'))
print(macbeth_words.count('Macbeth'))
print(macbeth_words.count('death'))

The word 'honor' appears 2 times.
The word 'blood' appears 15 times.
The word 'Macbeth' appears 61 times.
The word 'death' appears 9 times.
2
15
61
9


### **OPTIONAL BONUS:**

> Create a case-normalized count dictionary. (Show your code.) How do the counts of your target words change after doing case normalization?

### **BONUS ANSWER:**

> (your answers here)

-----

## **Combining tuples and dictionaries for a bigram count dictionary**

A nifty fact about tuples is that, unlike lists, they can be used as keys in a dictionary. This works precisely because they are immutable. Mutable lists are no good as dictionary keys, because we can't count on them staying the same.

We'll now combine our previous two exercises in order to build a dictionary of bigram counts. Remember that we represented each bigram as a tuple. Now these will take the place of our single-word keys from the previous exercise.

*Sidetrip: You may wonder why we care about two-word sequences. Bigrams, and in particular the frequencies with which bigrams occur, are in fact the foundation of language modeling technology. This is the technology that is used for predictive text (and many other applications). Bigram frequencies are simple but powerful!*

### **QUESTION:**

> Put the things you learned from this section together and write code to create a dictionary of bigram frequencies. In this dictionary, each bigram should be a dictionary key, and the value of that should be the frequency of how often that bigram occurs in your selected text. Show your code!

> Next, check your dictionary to see how frequent each of these bigrams are:

* walk on
* and the
* Python rules
* my love
* three more bigrams of your choice

### **ANSWER:**

> (your answers here)

In [None]:
bigramdict = {}
bgm_list = ['walk on', 'and the', 'Python rules', 'my love', 'run on', 'out of', 'black coffee']
for i in bgm_list:
  if i in bigramdict:
    bigrm = (bigm_list[i], bigm_list[i+1])
    bigramdict.append(bigrm)


{}

## **Make it interactive!**

You can use the following code to test new bigrams in your dictionary. Notice a few new bits of Python: `while` can be used to get input from the user (more on this next week). `format()` gives greater control over string formatting.

To use the code below, you'll need to change `pairDict` to the name of the bigram dictionary you created.

In [None]:
# code to test the dictionary
word1 = input('First Word:')
word2 = input('Second Word:')
wordPair = (word1, word2)
print('The pair {} occurs {} times in your text'.format(wordPair, pairDict[wordPair]))

# 4. The `FreqDist` function

Now you know how to build frequency dictionaries for a text, both for single word types and for bigrams. If we look at the frequency counts for all of the words in the text, what we have is a **frequency distribution** over words in a text. A frequency distribution is a set of collected statistics about the frequencies of any kind of countable event. 

When we create a frequency distribution related to a text, we usually count the frequency of each word in the vocabulary within the text.
Frequency distributions have some nice mathematical properties, and NLTK has a function designed specifically to create and work with frequency distributions. This is the `FreqDist` function. 

**READING:** Please read Section 3 of Chapter 1 of the NLTK book: http://www.nltk.org/book/ch01.html

First, let's make a frequency distribution for `text7`, the collection of Wall Street Journal articles. The input to `FreqDist` is a list of words, so we can do this with any list of words (or any list of other items, for that matter).

**NOTE:** If you get an error saying that `FreqDist` does not exist, uncomment the import statement on the first line of the code block below and try again. (Explicitly importing a function you want to use is one strategy to try, whenever you get this type of error.)



In [None]:
# from nltk.probability import FreqDist
### The function FreqDist creates a frequency distribution, which we store as the variable `fdist7`
### I chose 7 because it is for text7

fdist7 = FreqDist(text7)
print(fdist7)

<FreqDist with 12408 samples and 100676 outcomes>


The frequency distribution we produce is another special type within NLTK. When we print it out, we see some additional information - it is of type `FreqDist` and as 12408 samples and 100676 outcomes. The number of samples is the number of different event we're counting - in this case, it's the number of words in the vocabulary (should be the same as `len(set(text7))` -- test this!). The number of outcomes is the same as the token count (test this too!).

The `FreqDist` data type is very similar to a dictionary, and we can (for the most part) use it like a dictionary. In this dictionary, the keys are words, and teh values are the frequencies of those words - the number of times they occur in the text.

In [None]:
### by simply typing in the variable to which we stored the frequency distribution,
### we get a lot of information
fdist7

In [None]:
### we can also inspect the frequency of individual words quite easily
print(fdist7["company"])
print(fdist7["music"])
print(fdist7["musicians"])
print(fdist7["bankers"])
print(fdist7["banker"])


### **QUESTION:**

> Create a frequency distribution of your selected gutenberg text. Check the counts of the words you looked up before, with the earlier version of your frequency dictionary. Do the counts match? Include your code.

### **ANSWER:**

> Yes the counts match with the earlier counts of the code. Code shown below.

-----

In [None]:
freq_m = FreqDist(macbeth_words)
print(freq_m)
print(freq_m['honor'])
print(freq_m['blood'])
print(freq_m['Macbeth'])
print(freq_m['death'])
print(macbeth_words.count('honor'))
print(macbeth_words.count('blood'))
print(macbeth_words.count('Macbeth'))
print(macbeth_words.count('death'))

<FreqDist with 4017 samples and 23140 outcomes>
2
15
61
9
2
15
61
9


## **`FreqDist` methods**

There are some built-in methods for frequency distributions that can be used to show informative words in texts:

* `most_common()` - takes one argument (# of items X) and then shows the X most frequent items in the text, along with their counts
* `hapaxes()` - returns a list of **hapax legomena** - words that occur only one time in the text 

In [None]:
print(fdist7.most_common(20))

[(',', 4885), ('the', 4045), ('.', 3828), ('of', 2319), ('to', 2164), ('a', 1878), ('in', 1572), ('and', 1511), ('*-1', 1123), ('0', 1099), ('*', 965), ("'s", 864), ('for', 817), ('that', 807), ('*T*-1', 806), ('*U*', 744), ('$', 718), ('The', 717), ('``', 702), ("''", 684)]


In [None]:
print(fdist7.hapaxes())
print(len(fdist7.hapaxes()))

['Pierre', 'Elsevier', 'Agnew', 'fiber', 'resilient', 'lungs', 'symptoms', 'Loews', 'Micronite', 'spokewoman', 'properties', 'Dana-Farber', 'filter', '1953', '1955', 'Four', 'diagnosed', 'malignant', 'mesothelioma', 'asbestosis', 'morbidity', 'Groton', 'stringently', 'smooth', 'needle-like', 'classified', 'amphobiles', 'Brooke', 'pathlogy', 'Vermont', 'curly', 'Environmental', 'Protection', 'gradual', '1997', 'cancer-causing', 'outlawed', '160', 'Areas', 'dusty', 'burlap', 'sacks', 'bin', 'poured', 'cotton', 'acetate', 'mechanically', 'clouds', 'dust', 'hung', 'ventilated', 'Darrell', 'Yields', 'tracked', 'IBC', 'fraction', 'Compound', 'reinvestment', 'lengthened', 'longest', 'Donoghue', 'Longer', 'Shorter', 'Brenda', 'Malizia', 'Negus', 'rises', 'pour', 'Assets', '352.7', 'money-fund', 'Dreyfus', 'World-Wide', 'top-yielding', '9.37', '9.45', 'invests', 'waiving', '8.12', '8.14', '8.19', '8.22', '8.53', '8.56', 'J.P.', 'Bolduc', '83.4', 'energy-services', 'Terrence', 'Daniels', 'Royal'

### **QUESTION:**

> Table 3.1 in Chapter 1 of the NLTK book lists a number of different methods that can be applied to a `FreqDist` frequency distribution. Pick one of these, explain what it does in your own words, and show how to apply it in text (show your code).

### **ANSWER:**

> The method I chose is the .tabulate() method which returns the frequency distribution of a given text (in this case Macbeth) through a table. This makes it easier to read than a long list I think. Code shown below.

-----

In [None]:
print(freq_m.most_common(10))
print(freq_m.tabulate(30))

[(',', 1962), ('.', 1235), ("'", 637), ('the', 531), (':', 477), ('and', 376), ('I', 333), ('of', 315), ('to', 311), ('?', 241)]
   ,    .    '  the    :  and    I   of   to    ?    d    a  you   in   my  And   is that  not   it Macb with    s  his   be  The haue   me your  our 
1962 1235  637  531  477  376  333  315  311  241  224  214  184  173  170  170  166  158  155  138  137  134  131  129  124  118  117  111  110  103 
None


We can combine frequency counts with other tests to find interesting words in a text. For example, we can use a list comprehension to get a list of long words in a text - let's define *long* as having at least 10 characters. Then we can filter that list of long words by frequency (according to our frequency distribution), keeping only words that occur in the text/corpus at least 7 times.

In [None]:
### first, a list comprehension for creating a list of long words
wsj_long = [w for w in set(text7) if len(w) >= 10]
print(len(wsj_long))

2354


In [None]:
### next, we'll add a second condition to the list comprehension that uses 
### our frequency distribution
wsj_long_freq = [w for w in set(text7) if len(w) >= 10 and fdist7[w] >= 7]
print(len(wsj_long_freq))

152


Notice how we combine the two conditions inside the list comprehension:

* `if len(w) >= 10` checks for long words
* `and fdist7[w] >= 7` uses `and` to conjoin the two conditions - both have to be true for w to be selected

Now we can sort the list and take a look at it.

In [None]:
print(sorted(wsj_long_freq))

['Associates', 'Association', 'California', 'Commission', 'Commonwealth', 'Connecticut', 'Constitution', 'Containers', 'Department', 'Georgia-Pacific', 'Greenville', 'Industrial', 'Industries', 'International', 'Investment', 'Management', 'Massachusetts', 'Mitsubishi', 'Pennsylvania', 'Philadelphia', 'Securities', 'Transportation', 'University', 'Waertsilae', 'Washington', 'acquisition', 'acquisitions', 'activities', 'additional', 'administration', 'advertisers', 'advertising', 'announcement', 'apparently', 'appropriations', 'assistance', 'association', 'bankruptcy', 'businesses', 'candidates', 'circulation', 'commercial', 'commission', 'commitments', 'competition', 'compromise', 'conditions', 'conference', 'confidence', 'congressional', 'considered', 'construction', 'continuing', 'contributed', 'convertible', 'current-carrying', 'debentures', 'department', 'development', 'difference', 'differences', 'discussions', 'electronics', 'eliminated', 'employment', 'enforcement', 'engineering'

### **QUESTION:**

> Why do you think we're more interested in high-frequency long words than in high-frequency short words?

> From this list of high-frequency long words, could you make any guesses about what kind of data the words come from, or what kinds of texts?

### **ANSWER:**

> I think high-frequency long words are more interesting because they are less common than short words, which can help indicate what sort of content is within the text you are examining as well as the genre or category of the text as well.

> Based on the list of words, I would guess this is coming from the inaugural text or some other form of government documentation?

-----

## **Printing results to file**

There are a number of different ways to store results from your Python code in files. Today we will introduce one of the simplest methods - using `print()` to write results to a file.

In order to write to a file, we first need to create the file and ensure that it's ready to be written to.

Create the file by using `open()` - the first argument to `open` is the name you want your output file to have. The second argument (`'w'`) indicates that this file is being opened for writing:

In [None]:
outfile = open('myresults.txt','w')

Now we can use a `print` statement with an extra argument specifying which file object we want to print to. Basically, we are redirecting the `print` statement - everything that would normally be printed to the screen will now be printed to a file.

In [None]:
### I'm going to print my list of long, high-frequency words
### found in text7 to my output file
for w in wsj_long_freq:
    print(w, file=outfile)

### close the file when we're done printing to it
outfile.close()

You'll see that the file appears in your list of files in Colab (over to the left, click the folder icon). You can now double-click the name of the file and it'll be opened up in another window for inspection. You can also download the file.

# 5. Putting it all together

Now it's time to use frequency analysis to investigate a text. This time, select one of the texts you uploaded from the Project Gutenberg website. Apply at least three different methods you learned from this lab to your text. At least one of those should use a frequency distribution. Show all of your code, and include a short written discussion of your results. Print some of your results to a file, and submit that file along with the link to your notebook.


In [None]:
#choosing dracula text
freq_d = FreqDist(dracula_words)
print(freq_d.most_common(10))
print(freq_d.tabulate(40))
mcb_long_freq = [w for w in set(macbeth_words) if len(w) >= 8 and fdist7[w] >= 5]
print(mcb_long_freq)

draculadict = {}
for w in dracula_words:
  if w in draculadict:
    draculadict[w] = draculadict[w] + 1
  else:
    draculadict[w] = 1

print("The word 'blood' appears", draculadict['blood'], "times in Dracula.")
print("The word 'death' appears", draculadict['death'], "times in Dracula.")
print("The word 'monster' appears", draculadict['monster'], "times in Dracula.")

frankiedict = {}

for w in frankie_words:
  if w in frankiedict:
    frankiedict[w] = frankiedict[w] + 1
  else:
    frankiedict[w] = 1

print("The word 'blood' appears", frankiedict['blood'], "times in Frankenstein.")
print("The word 'death' appears", frankiedict['death'], "times in Frankenstein.")
print("The word 'monster' appears", frankiedict['monster'], "times in Frankenstein.")

submitfile = open('submitfile.txt', 'w')

for w in mcb_long_freq:
  print(w, file=submitfile)

submitfile.close()

[(',', 11099), ('.', 7549), ('the', 7474), ('and', 5803), ('I', 4846), ('to', 4662), ('of', 3707), ('a', 2955), ('in', 2466), ('that', 2436)]
    ,     .   the   and     I    to    of     a    in  that    he   was    it     ;     "     '    is   for    as    me   not   his  with   you    we    my    be   all    on  have    so   her   had    at     -   him   but    --     s which 
11099  7549  7474  5803  4846  4662  3707  2955  2466  2436  1996  1870  1808  1671  1561  1530  1498  1480  1476  1452  1393  1384  1283  1265  1250  1157  1124  1119  1055  1053  1036  1032  1030  1017   990   942   890   843   704   668 
None
['direction', 'According', 'construction', 'becoming', 'Pictures', 'question', 'Commission', 'presence', 'strength', 'familiar', 'troubled', 'consider', 'interest', 'assistance', 'starting', 'speculation', 'addition', 'continue', 'performance', 'reported', 'approach', 'especially', 'together', 'yesterday', 'concluded', 'continued', 'straight', 'something', 'decision', 

> The three functions are shown up above in my code. First I wanted to find the 40 most common words and produce a table to represent the find. Next, I wanted to compare the word counts for the same chosen keywords as I did in Macbeth - 'honor' produced an error (assuming it's because there is no appearance of it in the text, so I added a new keyword to supplement that loss. It's interesting to see how often the same keywords will appear through similar genres so that was my final method. Comparing the keywords in dracula, macbeth and frankenstein.

# 6. REMINDER: Wrapping up and submitting

Create a revision by going to **File > Save and Pin Revision**.

View your revision history at **File > Revision History**.

To submit: go to **Share** in the upper right corner, click **Get Shareable Link**, change the dropdown menu option to **Anyone with the link can edit**, and then **Copy Link**! This is what you'll submit on Canvas.