# In this notebook:

1. **Lexical Dispersion Plot** - where in the corpus a word appears
2. Plotting **Frequency Over Time**
3. **Collocations** of words - when they appear frequently near each other


# 1. Lexical Dispersion Plot - where in the corpus a word appears


#### Questions & Objectives:

- How can I measure how frequently a word appears across the parts of a corpus?
- How can I plot the occurrences of a word and how many words from the beginning of the corpus it appears?
- We will use the US Presidential Inaugural Addresses and which are provided with NLTK.

#### Key Points

- Lexical dispersion is a visualisation that allows us to see where a particular term appears across a document or set of documents
- We used NLTK’s dispersion_plot .

In [None]:
# run this cell now. It's the usual imports of text mining libraries

import nltk
import numpy
import string
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
nltk.download('punkt')

We can plot lexical dispersion of particular tokens.

**Lexical dispersion is a measure of how frequently a word appears across the parts of a corpus**. 

This plot notes the occurrences of a word and how many words from the beginning of the corpus it appears (word offsets). This is particularly useful for a corpus that covers a longer time period and for which you want to analyse how specific terms were used more or less frequently over time.

To create a lexical disperson plot, you will first load and import a different corpus, the inaugural corpus which are all US Presidential Inaugural Addresses and which are provided with NLTK: US Presidential Inaugural Addresses (1789-present)

Many libraries you will use (for text mining, visualisation, etc) come with build-in data sets for you to practice. They are nice this way.

In [None]:
nltk.download('inaugural')
from nltk.corpus import inaugural
from nltk.text import Text

inaugural_tokens = inaugural.words()
inaugural_texts = Text(inaugural_tokens)

To create the lexical dispersion plot for this corpus you also need to load `dispersion_plot` from the `nltk.draw.dispersion` package.

You can then call the dispersion_plot method given a set of parameters, including the target words you want to plot across the corpus, whether this should be done case-sensitively, and specifying the title of the plot.

In [None]:
from nltk.draw.dispersion import dispersion_plot

# the following command can be used to increase the size of the plot using width and hight specifications
plt.figure(figsize=(12, 9))
targets=['great','good','tax','work','change']

dispersion_plot(inaugural_texts, targets, ignore_case=True, title='Lexical Dispersion Plot')

### 🖇🐛Buddy task: What words might have been used only in some time periods?

- adjust the above code to include other words. Remember these re innaugural speeches of USA presidents 1789-present. What words might have appeared over certain periods and not others. Try words like 'war', 'peace', 'freedom', 'women', 'slavery', 'god'

Do not spend more than 2 minutes on this. Just try some words and move on. Things will get even more interesting in a minute.

Notice that it is really annoying that we cannot see exactly the year when the particular word was heavilly used. We will solve thsat problem in the next section. 

# 2. Plotting Frequency Over Time


#### Questions & Objectives:

- How can I extract and plot the frequency of specific terms over time?
- how to use a NLTK’s ConditionalFreqDist class to extract the frequency of defined words.

#### Key Points

- We extract terms and the years from the files using NLTK’s ConditionalFreqDist class from the nltk.probability package
- How to plot these on a graph to visualise how the use changes over time

#### Nested loops: a new challenging python syntax

This is a new Python syntaxt for loops inside of loops (nested loops), which is VERY CHALLENGING.

So do not worry if you do not get it at first (don't spend more than 2 minutes on this) just move on to further tasks.

In [None]:
# Run this cell and then read through it. 
    
# Goal: we have a set of fruit names, and a set of target letters,
# each time a fruit contains a target letter, return them
# eg. because 'pear' contains 'a' and 'p' return [('pear', 'p'), ('pear', 'a')]

fruits = ['pear', "banana", "kiwi", 'apple' ]
targets = ['a', 'p', 'w']

new_words = [(fruit, target)
            for fruit in fruits
            for letter in fruit
            for target in targets
            if letter == target
            ]
print(new_words)

# if this syntax is not clear, ask your buddy 🖇, but even if it is not super clear,
# you'll be fine, just continue

## How to take meta-information from files to understand corpus better

Similarly to lexical dispersion, you can also plot frequency of terms over time. This is similarly to the Google n-gram visualisation for the Google Books corpus but we will show you how to do something similar for your own corpus.

You first need to import NLTK’s ConditionalFreqDist class from the nltk.probability package. To generate the graph, you have to specify the list of words to be plotted (see targets) and the x-axis labels (in this case the year the inaugural was held which appears at the start of each file: fileid[:4]).

Note: file names ar ein format `1789-Washington.txt`, `1801-Jefferson.txt` so first 4 characters describe the year the speech was given

The required data for the plot needs to be in format, where word is repeated for each year as many times as it was used that year, eg `freedom` was used 4 times in 1801 and twice in 1805:

```
[('freedom', '1801'),
 ('freedom', '1801'),
 ('freedom', '1801'),
 ('freedom', '1801'),
 ('freedom', '1805'),
 ('freedom', '1805'),
 ('freedom', '1809'),
...
```

This dataset is created by:

- return a tupple with a word and the year of the speech `(target, fileid[:4])`
- for each **filename** (fileid) from the speeches set: `for fileid in inaugural.fileids()`
- then for each **word** in that file `for word in inaugural.words(fileid)`
- then for each **target** word in our specified target words
- use that word **only if** word starts with the target `if word.lower().startswith(target))`
    
```
[(target, fileid[:4])
    for fileid in inaugural.fileids()
    for word in inaugural.words(fileid)
    for target in targets
    if word.lower().startswith(target)]
```
    

The ConditionalFreqDist object (cfd) stores the number of times each of the target words appear in the each of the speaches and the plot() method is used to visualise the graph.

In [None]:
from nltk.probability import ConditionalFreqDist

# type this to set the figure size
plt.rcParams["figure.figsize"] = (12, 9)

targets=['great','good','tax','work','change']

cfd = nltk.ConditionalFreqDist(
    [(target, fileid[:4])
    for fileid in inaugural.fileids()
    for word in inaugural.words(fileid)
    for target in targets
    if word.lower().startswith(target)])

cfd.plot()

### 🐛Minitask: 

- change the words in the above graph. Use the words you discussed with your buddy above.

- try to use regular expressions instead of specific words (see hints below)

eg. if you wanted to compare together occurances of

- words `man & men`
- word `freedom`
- any other words that start with `free` you could use targets:

`targets=['^m[ea]n$', '^freedom$', '^free']`

and instead of 

`if word.lower().startswith(target)])`

use

`if re.search(target, word.lower()))`

In [None]:
# copypaste the graph code to this cell and write your answer here
import re




# 3. Collocations


#### Questions & Objectives:

- How can I see what terms are often used together in a text or corpus?
- We want to see words that collocate, occur together more often than by chance.
- We will see what words co-occur within five words of each other.
- We will then see which words appear more than ten times together.
- We will then look at a measure to score the likelihood of these collocations being unusual.

#### Key Points

- We will use NLTK’s `BigramAssocMeasures()` and `BigramCollocationFinder` to find the words commonly found together in the US Presidential Inaugural Addresses set.
- We will score these collocations using bigram_measures.likelihood_ratio

We may want to see what terms are often used together. We can do this by looking for collocations in a text, i.e. two word tokens occurring together in the text more often than would be expected by chance.

For this we need to import the nltk.collocations module and more specifically BigramAssocMeasures() and BigramCollocationFinder. We allow a window of 5 words between collocated words.

In [None]:
from nltk.collocations import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder

bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(inaugural_tokens, 5)

We then look for words that appear together 10 times or more.

In [None]:
finder.apply_freq_filter(10)

A number of measures are available to score collocations or other associations including `bigram_measures.likelihood_ratio`. We apply this measure below and show the top ten collocated tokens (occuring in a window of 5 tokens with a frequency of 10 or more).

In [None]:
finder.nbest(bigram_measures.likelihood_ratio, 10)

### 🐛Minitask:  Re-do the colocation analysis after removing stopwords, punctuation, etc

Change the code below to display collocations in the inaugural speeches with these extras:

- with all tokens in the inaugural_tokens being lowercased
- after removing stopwords, punctuation and single digits

Refer back to previous notebook for help.

In [None]:
inaugural_tokens = inaugural.words()

# HERE you will want to filter inaugural_tokens to not contain stopwords, punctuation, etc 



bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(inaugural_tokens, 5)
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.likelihood_ratio, 10)