# Jupyter Notebook & Python Demo

- This Notebook file can be run directly on your laptop
- Method 1: Launch Anaconda Navigator, click Jupyter Notebook icon
- Method 2: In a command-line environment, `cd` (change directory) into the workshop repo, then type in `jupyter notebook`

## Getting around in Jupyter Notebook

- Click `+` to create a new cell, ► to run (Also: `Ctrl+ENTER`)
- Choose appropriate cell type (Code or Markdown)
- `Alt+ENTER` to run cell, create a new cell below
- `Shift+ENTER` to run cell, go to next cell
- More on [this page](https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/)

In [None]:
print('Hello, world!')    # printing a string

In [None]:
'Hello, world!'           # returning a string

In [None]:
len('Hello, world!')

In [None]:
import nltk

In [None]:
nltk.word_tokenize('Hello, world!')

In [None]:
words = nltk.word_tokenize('Hello, world!')
words

In [None]:
len(words)

In [None]:
sent = "You haven't seen Star Wars...?"
nltk.word_tokenize(sent)

In [None]:
from matplotlib import pyplot as plt

In [None]:
# number of speakers in million
plt.bar(['Zulu', 'German', 'Polish'], [10, 100, 45])
plt.show()

## Why Jupyter Notebook
JNB lets us weave together 4 essential components into a single document: 
  1. Python code
  2. code output (`print`, returned values, etc.)
  3. visualization through in-line plots
  4. narration and documentation as Markdown cells


Additionally, it is good practice to make a point of **showing the "data"**:
  5. Data itself: snippets, examples
  6. The process in which data gets cleaned and transformed

#### Benefits
- The resulting Notebook document presents a **complete picture of "code as research"**
- The Notebook document is live code that can be run by anyone: easy **reproducibility**
- **Sharability**: [GitHub](https://www.github.com) and other online platforms not only hosts but renders Notebook documents in easily readable and browsable form  


## Processing the Gettysburg Address
- Let's process Abraham Lincoln's Gettysburg address, already in the `data` directory

In [None]:
gfile = 'data/1863-Gettysburg Address.txt'
gtxt = open(gfile).read()
gtxt

In [None]:
print(gtxt)

In [None]:
len(gtxt)

In [None]:
# JNB by default has "pretty printing" turned on, which prints list items in separate lines. 
# Toggle it off. 
%pprint

### Type vs. token, TTR
- *Tokens* are individual instances of linguistic units. 
- *Types* are unique classes found in the tokens. 
- *TTR* ("type-token ratio") is a measure of vocabulary richness (with a huge caveat)

In [None]:
gtoks = nltk.word_tokenize(gtxt)
gtoks

In [None]:
len(gtoks)

In [None]:
# list comprehension: returns a new list where each item is transformed
gtypes = set([w.lower() for w in gtoks])  
gtypes

In [None]:
len(gtypes)

In [None]:
gttr = len(gtypes)/len(gtoks)
gttr

### Average sentence length
- NLTK has a handy sentence tokenizer: `nltk.sent_tokenize()`

In [None]:
# sentence tokenization
nltk.sent_tokenize("Hello, world! I come in peace.")

In [None]:
gsents = nltk.sent_tokenize(gtxt)
gsents

In [None]:
gsents[0]

In [None]:
gsents[-1]

In [None]:
len(gsents)

In [None]:
# Average sentence length
gsentlen = len(gtoks)/len(gsents)
gsentlen

### Word frequency 
- `nltk.FreqDist()` builds a frequency distribution. Pass tokenized words. 

In [None]:
gfd = nltk.FreqDist(gtoks)
gfd.most_common()

In [None]:
gfd.most_common(10)

In [None]:
gfd['the']

In [None]:
# relative frequency
gfd.freq('the')

In [None]:
top_words = ['that', 'the', 'to', 'we', 'here', 'a', 'and', 'nation', 'of']
top_gfreq = [gfd.freq(w) for w in top_words]
top_gfreq

In [None]:
plt.plot(top_words, top_gfreq)
plt.show()

### Summary: the Gettysburg Address 
- 309 word tokens and 141 word types
- TTR: 0.4563
- 10 sentences
- Average sentence length: 30.9 words per sentence
- Top words include 'that', 'the', 'to', 'we', 'here', 'a', etc. 

NB: punctuation and symbols were included in the token count and the average sentence length. 

## Your turn: "I Have A Dream" by Martin Luther King Jr. 
- Longer or shorter than "Gettysburg address"?
- TTR?
- Average sentence length: longer or shorter?
- Top words and their frequencies: difference?

In [None]:
kfile = 'data/1963-I Have a Dream.txt'
ktxt = open(kfile).read()
ktxt[:500]  # first 500 characters. 

In [None]:
ktxt[-500:]  # last 500 characters  

In [None]:
# (1) build a list of word tokens
# (2) build a set of word types
#   from (1) & (2), compute TTR
# (3) build a list of tokenized sentences
#   from (1) and (3), compute average sentence length
# (4) build a word frequency distribution, from (1) 

In [None]:
# 'b-': blue line
# build top_kfreq first! 
plt.plot(top_words, top_gfreq, 'b-', top_words, top_kfreq, 'g-')
plt.title('word types and frequencies')
plt.xlabel('Lincoln (blue) vs. King (green)') 
plt.ylabel('relative frequency')
plt.show()