# Macbeth word frequency ala Python

I don't expect you to understand all this Python code!  

Things to notice:
* Lots of Python code is available online for you to take and use
* Python can be used to interact with web sites
* Python can be used to manipulate and analyze text 

The following is intended merely as an illustration of some things we can do with Python, and we'll use this code later in the quarter when looking into language processing.

## Requests and NLTK

We are going to use some Python code to grab the text of Macbeth from Project Gutenberg, and then use some other Python code to analyze the word counts of words in the play.

* **requests** will allow us to get content from web sites
  * "**Requests** is an elegant and simple HTTP library for Python, built for human beings."
  * https://requests.readthedocs.io/en/latest/
* **nltk** will allow us to do language processing
  * "**NLTK** is a leading platform for building Python programs to work with human language data."
  * https://www.nltk.org/

## Install and import

Before we can use the above libraries, we need to make sure their code is available (i.e. installed on the system).  We then need to "import" the code to make it available here to use.

This system already has a lot of libraries that you would otherwise have to install yourself.  (We'll talk about Python installation issues and a couple technical details later in the course.)

In [None]:
# These "import" statements allow us to use the code from requests and nltk
# If requests imports without an error, then it's already installed.
import requests
import nltk
from nltk.tokenize import word_tokenize

# This downloads some important information that nltk uses to analyze words
nltk.download('punkt')

## Launch into some Python code

Get the text file for Macbeth from [Project Gutenberg](https://www.gutenberg.org/) and put it into the text file named 'macbeth.txt'

In [None]:
target_url = "https://www.gutenberg.org/files/1533/1533-0.txt"
response = requests.get(target_url)
with open('/home/jovyan/macbeth.txt','w',encoding='utf-8') as f:
    f.write(response.text)

**Important note**:  
* In this Jupyter environment, you will not be able to save new files in the directory containing this notebook.  If you want to save new files, you must save them in your "home" directory, which here is a directory called "/home/jovyan".
* We won't get into too many filesystem or Linux details in this course, but you should note that your "classwork" directory can only contain notebook files that I explicitly distribute to you, and nothing else.
* You can make new notebooks, upload notebooks, upload new data, save data files, and etc, as long as you do that in some directory other than your "classwork" directory.

## Back to our analysis

Open the text file 'macbeth.txt' and form a list of words from the document. 

(Dealing separately with the document here is an unnecessary step since we can just get the data directly from `response.text` above, but it's included here to show it.)

In [None]:
document_text = open('/home/jovyan/macbeth.txt', 'r')
macbeth_text = document_text.read()
print(macbeth_text)

Notice above that there are some "funny" characters.  Text is a very complex thing to work with on computers, and we'll note some of these character characteristics later.

For now, we "tokenize" the play, that is, break it up into pieces.  Here we break the play text up into a huge list of individual words.

In [None]:
text_string = macbeth_text.lower()
text_tokens = word_tokenize(text_string)

Form a dictionary that has the summary count of each tokenized word, indexed by the word.

Example: `frequency['the'] = 2` would indicate 'the' occurs twice.

In [None]:
frequency = {}
for word in text_tokens:
    count = frequency.get(word,0)
    frequency[word] = count + 1

Print out the frequency of words for all tokenized words.

If you uncomment the last two lines, you can only print the occurrence of matched words.

In [None]:
frequency_list = frequency.keys()
for words in frequency_list:
    print(words, frequency[words])
#     if 'code' in words:
#         print(words, frequency[words])

# Before we end...

## Refreshing the notebook
* If you mess things up in this notebook and want to completely start all over again with an original copy of the notebook, go up to the top and click on "Start Over"
* This will erase your current work and allow you to begin with a fresh copy
  
## Saving your work
* You should **always** save your notebook periodically.
* If you are idle in this environment for too long, the system may time out your session and you'll lose unsaved work.
* If you lose your internet connection, you may also lose your work.
* **Always** periodically save your work.

## Submitting assignments

* Your assignments will consist of notebooks in this environment.
* For assignments, you will click on the "Submit" button.

## Test:
* Find out whether "code" appears in Macbeth
* Edit the markdown code in this cell and:
  * Enter any relevant words here: 
  * Enter the number of occurrences here:
* Save the notebook
* Click on the Submit button now to see what it does.

# End