<br>
<img style="float:left" src="http://ipython.org/_static/IPy_header.png" />
<br>

# Working with a corpus of Malcolm Fraser's speeches

In [None]:
import nltk
nltk.download()

In [None]:
from __future__ import print_function, division #just some libraries and packages to get us started.
import sys
from nltk.book import * #some example data that comes with the nltk book
from IPython.display import display, clear_output
sys.path.append("/usr/lib/python2.7/site-packages/")
%matplotlib inline

## A Text Mining Analysis of Academic Libraries' Tweets
_**By Sultan M. Al-Daihani, Alan Abrahams**_ 

This study applied a text mining approach to a dataset of tweets by ten academic libraries.

## Malcolm Fraser and his speeches

So, we are going to be working with a corpus of speeches made by Malcolm Fraser. 

In [None]:
# this code allows us to display images and webpages in our notebook
from IPython.display import display
from IPython.display import display_pretty, display_html, display_jpeg, display_png, display_svg
from IPython.display import Image
from IPython.display import HTML
import nltk

In [None]:
HTML('<iframe src=http://en.wikipedia.org/wiki/Malcolm_Fraser width=950 height=350></iframe>')

In [None]:
HTML('<iframe src=http://www.unimelb.edu.au/malcolmfraser/ width=950 height=350></iframe>')

## Cleaning the corpus

1. Not break the code with unexpected input
2. Ensure that searches match as many examples as possible
3. Increasing readability, the accuracy of taggers, stemmers, parsers, etc.

## Exploring the corpus

First of all, let's load in our text.

Download the txt files available in this url: http://archives.unimelb.edu.au/malcolmfraser/explore/radiotalks

Via file management, upload the zip folder we downloaded from the University of Melbourne website into the same folder that you are running the notebook in.

We can also look at file contents within the Jupyter Notebook itself.

Importing os allows us to open, read, write and do other this with folders and files

In [None]:
import os

In [None]:
# import tokenizers
from nltk import word_tokenize
from nltk.text import Text

In [None]:
import zipfile
zip_ref = zipfile.ZipFile('UMA_Fraser_Radio_Talks.zip', 'r')
zip_ref.extractall('/Users/resplat/VALA-FraserCorpus')
zip_ref.close()

In [None]:
# make a list of files in the directory 'UMA_Fraser_Radio_Talks'
files = os.listdir('UMA_Fraser_Radio_Talks')
print(files[0:3]) #note that Python starts counting at 0

In [None]:
corpus_path = 'UMA_Fraser_Radio_Talks'

In [None]:
# print(file contents)
#change zero to something else to print(a different file)
# "r" tells python that it has the ability to read the files
files = open(os.path.join(corpus_path, files[1]), "r", encoding="latin")
text = files.read()
print(text)

### Exploring further: splitting up text

In [None]:
# split the file we read in above into two parts
# split on the characters <!--end metadata-->
data = text.split("<!--end metadata-->")

In [None]:
# view the first part. Note that Python starts counting from 0
print(data[0])
# if you want to view the second half of the split i.e. the data, not metadata, enter data

In [None]:
# split into lines, add '*' to the start of each line
# \n is a newline character
for line in data[0].split('\n'):
    print('*', line)

In [None]:
# skip empty lines and any line that starts with '<'
for line in data[0].split('\n'):
    if not line:
        continue
    if line[0] == '<':
        continue
    print('*', line)

In [None]:
# split the metadata items on ':' so that we can interrogate each one
for line in data[0].split('\n'):
    if not line:
        continue
    if line[0] == '<':
        continue
    element = line.split(':')
    print('*', element)

What's the problem here?? ^^^^ Look at the 'Collection URI'

In [None]:
# actually, only split on the first colon
for line in data[0].split('\n'):
    if not line:
        continue
    if line[0] == '<':
        continue
    element = line.split(':', 1)
    print('*', element)

### **Challenge**: Building a Dictionary

We've already worked with strings (a 'string' of characters) and lists (lists of words, numbers etc...). Another kind of data structure in Python is a *dictionary*.

Here is how a simple dictionary works:

In [None]:
# create a dictionary
commonwords = {'the': 4023, 'of': 3809, 'a': 3098}
# search the dictionary for 'of'
commonwords['of']

In [None]:
type(commonwords)

In [None]:
type(text) #the files we read into python and printed to the screen as a big block of text

In [None]:
type(data) #the variable we used to split the text into metadata and data

Dictionaries are a great way to work with the metadata in our corpus. Let's build a dictionary called *metadata*:

Your first line will look like this:

      metadata = {}

In [None]:
metadata = {} # create an empty dictionary to populate
for line in data[0].split('\n'):
    if not line:
        continue
    if line[0] == '<':
        continue
    element = line.split(':', 1) # This code is the same as earlier
    metadata[element[0]] = element[-1] #This is new! It associates the metadata category with the value next to it
print(metadata) 

In [None]:
# now are metadata categories are searchable
# look up the Date
print(metadata['Date']) # you can change this to 'Title', etc...

### Building functions

**Challenge**: define a function that creates a dictionary of the metadata for each file and gets rid of the whitespace at the start of each element

**Hint**: to get rid of the whitespace use the *.strip()* command.

In [None]:
# open the first file, read it and then split it into two parts, metadata and body
data = open(os.path.join(corpus_path, 'UDS2013680-1-full.txt'), 'r', encoding='latin')
data = data.read().split("<!--end metadata-->")

In [None]:
def parse_metadata(text): # syntax to create our own reuseable code, called functions
    metadata = {}
    for line in text.split('\n'):
        if not line:
            continue
        if line[0] == '<':
            continue
        element = line.split(':', 1)
        metadata[element[0]] = element[-1].strip(' ') #note this is also exactly the same as our first dictionary
    return metadata 

Test it out! 

In [None]:
parse_metadata(data[0])

## Conditional Frequency Distributions

In [None]:
#import conditional frequency distribution
from nltk.probability import ConditionalFreqDist
import matplotlib
% matplotlib inline

In [None]:
cfdist = ConditionalFreqDist()
for filename in os.listdir(corpus_path):
    text = open(os.path.join(corpus_path, filename), 'r', encoding='latin').read()
    #split text of file on 'end metadata'
    text = text.split("<!--end metadata-->")
    #parse metadata using previously defined function "parse_metadata"
    metadata = parse_metadata(text[0])
    #skip all speeches for which there is no exact date
    if metadata['Date'][0] == 'c': # if you change [0] to [1] you get double years
        continue
    #build a frequency distribution graph by year, that is, take the final bit of the 'Date' string after '/'
    cfdist['count'][metadata['Date'].split('/')[-1]] += 1
cfdist.plot()

Now let's build another graph, but this time by the 'Description' field:

In [None]:
cfdist2 = ConditionalFreqDist()
for filename in os.listdir(corpus_path):
    text = open(os.path.join(corpus_path, filename), 'r', encoding='latin').read()
    text = text.split("<!--end metadata-->")
    metadata = parse_metadata(text[0])
    if metadata['Date'][0] == 'c':
        continue
    cfdist2['count'][metadata['Description']] += 1
cfdist2.plot()

We've got messy data! What's the lesson here?
<br>

**Bonus challenge**: Build a frequency distribution graph that includes speeches without an exact date.
Hint: you'll need to tell Python to ignore the 'c' and just take the digits

In [None]:
cfdist3 = ConditionalFreqDist()
for filename in os.listdir(corpus_path):
    text = open(os.path.join(corpus_path, filename), 'r', encoding='latin').read()
    text = text.split("<!--end metadata-->")
    metadata = parse_metadata(text[0])
    date = metadata['Date']
    if date[0] == 'c':
        year = date[1:]
    elif date[0] != 'c':
        year = date.split('/')[-1]
    cfdist3['count'][year] += 1
cfdist3.plot()

### Ordering our data

The way in which you organise your data will affect the ways in which you can interrogate it. Because our data samples span a long stretch of time, it might be interesting to investigate the ways in which Malcolm Fraser's language changes over time. 

#### Regular expressions
Regular expressions are a powerful means of searching for patterns in data. In this case, we're going to construct a regular expression to find the year of each speech. 

In [None]:
import re
import os
# a path to our soonwordso-be organised corpus
newpath = '../corpora/fraser-year'
os.makedirs(newpath)
files = os.listdir(corpus_path)
# define a regex to match year portion of date
yearfinder = re.compile('19[0-9]{2}')
for filename in files:
    # split file contents at end of metadata
    text = open(os.path.join(corpus_path, filename), 'r', encoding='latin')
    data = text.read().split("<!--end metadata-->")
    # get date from data[0]
    # use our metadata parser to get metadata
    metadata = parse_metadata(data[0])
    #look up date field of dict entry
    date = metadata.get('Date')
    # search date for year
    yearmatch = re.search(yearfinder, str(date))
    #get the year as a string
    year = str(yearmatch.group())
    # make a directory with the year name
    if not os.path.exists(os.path.join(newpath, year)):
        os.makedirs(os.path.join(newpath, year))
    # make a new file with the same name as the old one in the new dir
    fo = open(os.path.join(newpath, year, filename),"w")
    # write the content portion, without metadata
    fo.write(data[1])
    fo.close()

Did it work? How can we check?

In [None]:
print(os.listdir(newpath))
print(os.listdir(newpath + '/1981'))

## Using NLTK to analyse the Fraser Corpus

Python regards a text file as a single long string of characters. The first thing to do is to start breaking the text up into sentences and words.

In [None]:
from nltk import word_tokenize
speech = open('../corpora/fraser-year/1975/UDS2013680-678-full.txt', "r").read() 
tokens = word_tokenize(speech)
print(tokens[:100])

We can do some more interesting linguistic analysis if we use Part of Speech tagging. 

In [None]:
sentence = "They refuse to permit us the refuse permit"
words = word_tokenize(sentence)
tagged = nltk.pos_tag(words, tagset='universal')
print(tagged)

In [None]:
tag_fd = nltk.FreqDist(tag for (word, tag) in tagged)
tag_fd.most_common()

### Challenge!
Use Part of Speech tagging to tag the speech that we have just tokenised the do the following:
* Find the most common parts of speech
* Find the most common verbs and create a frequency Distribution graph of your result
* Find the 10 most common nouns in the speech

*Hint: to find the most common verbs and nouns, you will need to create a list that contains only the verbs or only the nouns from the speech. Use a for loop to create your list. Then create a frequency distribution*

In [None]:
tagged_speech = nltk.pos_tag(tokens, tagset = 'universal')
speech_fd = nltk.FreqDist(tag for (word, tag) in tagged_speech)
speech_fd.most_common()

In [None]:
verblist = []
for (word, tag) in tagged_speech:
    if tag == 'VERB':
        verblist.append(word)
# Check the length of the list of verbs. 
#If it matches the number of verbs above, you can be fairly sure your loop has worked as expected
print(len(verblist))
verb_fd = nltk.FreqDist(verblist)
print(verb_fd.most_common()[:10])
verb_fd.plot(cumulative = True)

In [None]:
nounlist = []
for (word, tag) in tagged_speech:
    if tag == 'NOUN':
        nounlist.append(word)
print(nounlist[:10])
print(len(nounlist))
noun_fd = nltk.FreqDist(nounlist)
print(noun_fd.most_common()[:10])

**Extension**
There are a few things to note about this result - Prime and Minister have been returned as two different, equally frequent nouns. Because we're humans, not computers, we know it's likely that what we're actually seeing is 'Prime Minister'. It's also unlikely that 'North' 'South' occur alone - perhaps Mr Fraser was talking baout North and South Vietnam? We could test for bigrams (words that typically occur side by side) to see if this is the case. In order to perform this test, we must first convert our list of tokens into and NLTK text. We can then use specific NLTK functions on the text.

In [None]:
print(type(tokens))
speech_text = nltk.Text(tokens)
print(type(speech_text))
speech_text.collocations()

In [None]:
speech_text.concordance("wool")

### Some linguistics...

In the context of Fraser's speech, there are nearly twice as many nouns as verbs, and the verbs are generally quite simple ones (parts of To Be and To Have make up about a quarter). This suggests that Fraser's speech, even when giving a radio talk to his electorate, is more towards the formal end of the spectrum. 

## Recap
So far today we have:
* Imported text into NLTK
* Used functions and loops to investigate metadata and organise our corpus
* Tokenised raw text into words
* Tagged words as parts of speech
* Converted a list into NLTK Text for further analysis

## Stopwords
In most texts you'll notice that a lot of space is taken up by little words like 'and' and 'of' and 'the' which don't add a lot to our understanding of text. These are called 'stop words'. It will help our analysis if we exclude them.

In [None]:
fdist1 = nltk.FreqDist(tokens) #this is a frequency distribution of all the words in the corpus. It is not conditional
fdist1.most_common()[:20]

In [None]:
print(len(speech_text))
print(len(set(speech_text)))

In [None]:
#First let's get rid of the puncutation
speech = [word for word in speech_text if word.isalpha()]
print(len(speech))#Then get rid of capitals
vocab = [word.lower() for word in speech]
print(len(set(vocab)))

In [None]:
from nltk.corpus import stopwords
#Create a variable that contains all the stopwords in the NLTK corpus
ignored_words = nltk.corpus.stopwords.words('english')
unstopped = [word for word in vocab if word not in stopwords.words('english')]
fdist2 = nltk.FreqDist(unstopped)
fdist2.most_common()[:20]

*Note: We could have condensed the first two steps into a single line of code that looked like this:*

        unstopped = [word for word in speech if word.lower() not in stopwords.words('english') and word.isalpha()]

## Collocation
First, let's look for bigrams in the whole list of tokens:

In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
sorted(finder.nbest(bigram_measures.raw_freq, 10))

That doesn't tell us much. Let's try again with 'unstopped' our list of tokens with the punctuation and stopwords removed

In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(unstopped)
sorted(finder.nbest(bigram_measures.raw_freq, 10))

### N-grams 

In [None]:
print(sent2)

In [None]:
from nltk.util import ngrams
trigrams = ngrams(sent2, 3)
for gram in trigrams:
    print(gram)

There are a lot of trigrams in the sentence, and they don't tell us much. It's when n-grams are repeated that they start to get interesting.

In [None]:
from collections import defaultdict
#this will let us find duplicates in our list of n-grams

In [None]:
#Define a function that will find duplicate lists within a list (i.e. duplicate n-grams within a text)
def list_duplicates(seq):
    tally = defaultdict(list) 
    for i, item in enumerate(seq): #returns bigrams containing a count and the values obtained from iterating over seq
        tally[item].append(i) #append new items to the existing items 
    return ((len(locs), key) for key, locs in tally.items() if len(locs) > threshold)

Note that `defaultdict()` means that if a key is not found in the dictionary, then instead of a KeyError being thrown, a new entry is created. The type of this new entry is given by the argument of defaultdict, in this case a `list`.

In [None]:
#Define a function that will find n-grams that occur at least 4 times
def ngrammer(text, gramsize, threshold = 4): #give mutiple arguments to the function
    def list_duplicates(seq): #you can define functions without functions
        tally = defaultdict(list)
        for i, item in enumerate(seq):
            tally[item].append(i)
        return ((len(locs), key) for key, locs in tally.items() if len(locs) > threshold) #return n-grams if > threshold
    raw_grams = ngrams(text, gramsize) 
    dupes = list_duplicates(raw_grams)
    return sorted(dupes, reverse = True) #show dupes from highest to lowest, hence reverse=True

In [None]:
sense = [word for word in text2 if word.isalpha()] #remove non-alphabetical characters from Sense and Sensibility
ngrammer(sense, 3, threshold = 20) #test our ngrammer function on Sense and Sensibility

In [None]:
ngrammer(tokens, 3, threshold = 2) #Test function on Fraser data
#try changing the size and threhold