## Language analysis using Python and Spacy

We will use the Python module Spacy to analize the [Malcolm Fraser Electorate Radio Talks](https://archives.unimelb.edu.au/explore/collections/malcolmfraser/explore/radiotalks) archives. On the way we will learn a couple of other tricks to make our lives easier.


```
wget https://archives.unimelb.edu.au/__data/assets/text_file/0006/1717746/UMA_Fraser_Radio_Talks.zip
pip3 install spacy
python -m spacy download en_core_web_sm
```

That gets us ready to do the linguistic work.

But before that, we need to read the data in and clean it up.

In [None]:
# Can we see the files? Yes - here are the titles. 
import os
files = os.listdir('UMA_Fraser_Radio_Talks')
print(files[:3])

These files are txt files - the file type that traditionally contains text. Great, let's look inside them.

In [None]:
# Can we read the files as text? Yes... No.
f = open(os.path.join('UMA_Fraser_Radio_Talks', files[0]), "r")
text = f.read()
print(text)

### UnicodeDecode Errors - character sets, unicode and automation

UnicodeDecode errors will be frequent in Python version 3 - they have changed how they represent human language within the Python language. While it's annoying - and slightly more work - the solution we implement below is relatively quick and solves the problem *regardless of the language of the documents we are researching*.


In [None]:
# Can we read the files as binary text files? Yes!
f = open(os.path.join('UMA_Fraser_Radio_Talks', files[0]), 'rb')
text = f.read()
print(text)

OK, that looks better. What was that **b** we added to the open path? That told python to open the file to (r)ead - but to read the file as a byte stream instead of as text<sup>1</sup>. That's handy, but why and is it useful? 

<sup>1</sup> with the default ASCII encoding. This is important later.

Well, we can now read the file - and we can see the section near the bottom of the file that caused the issue. That second box shows another example - although in this case it's valid ASCII.


![Hex Chars](imgs/non-text-characters_small.png "Hex Chars")


What we are seeing here is a data quality issue. This is very common, especially with text that has been scanned from PDFs. The image to text transfer will make a best guess, and in this case it's guessed an unusual character. If you take a look at the [original pdf](https://digitised-collections.unimelb.edu.au/bitstream/handle/11343/40335/312821_2007-0023-0372.pdf), you can see that this is because Mr Cain had made hand written notes on the bottom of his talk. 

The character combination **\x** is a restriction within [ASCII](https://en.wikipedia.org/wiki/ASCII) text files - it is a leading indicator used to encode characters as bytes. For instance \x20 is the space character, \x3a is the colon character (:) and \x41 is capital A. 

We wont go into character encoding now, but we will show into how to solve the problem. The problem starts with **\x84** being an invalid ASCII code - **\x7F** (the "Delete" character) is as high as ASCII goes. If you really want to investigate further, we recommend [this primer on Unicode as a great start](https://wisdom.engineering/awesome-unicode/).

Why do we want to solve this - why not just import the files as "bytestreams"? The primary reason is that text manipulation in python is powerful and easy to use - but it only works on text, not bytestreams. 

Thankfully, those that came before us have solved the problem of guessing the encoding and have written a Python modue called Chardet ("character detect") to solve it. How did I know to use this software? I did a search for "python determine character set".

### Character Set Detection

Let's install chardet and use that to get an idea for the probable character encoding. This code is copy and pasted from [the Chardet documentation](https://chardet.readthedocs.io/en/latest/usage.html#example-detecting-encodings-of-multiple-files) and slightly altered.


In [None]:
import chardet
from chardet.universaldetector import UniversalDetector

detector = UniversalDetector()

for f in files[20:25]:
    filename=os.path.join('UMA_Fraser_Radio_Talks',f)
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(f, ": ", detector.result)

Ideally all text files for what are known English Language texts would just be in English, but as you can see the data isn't as clean as we would like. In fact, there are at least three detected character encodings - **Windows-1252**, **ASCII** and **ISO-8859-1**. 

We will need to deal with this on a file by file basis - but thankfully we don't actually need to know what each of those encodings represents. 

The reality is that - like with our first file - the problem is dirty data problem. A single misreaded character by the Optical Character Recognition software and everything is a mess.

Let's turn the above into a function that just returns the encoding so we can use it repeatedly.

In [None]:
def get_charset(filename):

    import chardet
    from chardet.universaldetector import UniversalDetector

    detector = UniversalDetector()
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()

    return detector.result['encoding']

In [None]:
filename = os.path.join('UMA_Fraser_Radio_Talks', files[0])
encoding = get_charset(filename)
f = open(filename, 'r', encoding=encoding)
text = f.read()
print(text)

### Meta Data

Now that looks much better. 

Let's keep cleaning it up. Looks like we can remove the metadata from the top. We could put it into a database or a dictionary for later use if we want. For the moment, let's just split it out.

In [None]:
data = text.split("<!--end metadata-->")
data[0]

Ok. Let's see if we can't start making some sense of what we have.

In [None]:
# split into lines, add '*' to the start of each line
# \n is a newline character
for line in data[0].split('\n'):
    print('*', line)

In [None]:
# skip empty lines and any line that starts with '<'
for line in data[0].split('\n'):
    if not line:
        continue
    if line.startswith('<'):
        continue
    print('*', line)

In [None]:
# split the metadata items on ':' so that we can interrogate each one
for line in data[0].split('\n'):
    if not line:
        continue
    if line[0] == '<':
        continue
    element = line.split(':')
    print('*', element)


In [None]:
# actually, only split on the first colon
for line in data[0].split('\n'):
    if not line:
        continue
    if line[0] == '<':
        continue
    element = line.split(':', 1)
    print('*', element)

Good. Let's put it into a dictionary so we can use it later. 

The point of dictionaries is to store a key (the word) and a value (the count). When you ask for the key, you get its value.

Notice that you use curly braces for dictionaries, but square brackets for lists.

Dictionaries are a great way to work with the metadata in our corpus. Let's build a dictionary called metadata:

Your first line will look like this:

  metadata = {}

In [None]:
metadata = {}
for line in data[0].split('\n'):
    if not line:
        continue
    if line[0] == '<':
        continue
    element = line.split(':', 1)
    metadata[element[0]] = element[-1]
print(metadata)

In [None]:
print(metadata['Date'])

Let's turn that into a function as well - we will be coming back to it for each file.

In [None]:
def parse_metadata(text):
    metadata = {}
    for line in text.split('\n'):
        if not line:
            continue
        if line[0] == '<':
            continue
        element = line.split(':', 1)
        metadata[element[0]] = element[-1].strip(' ')
    return metadata

In [None]:
md = parse_metadata(data[0])
print(md)

### Metadata collection
Now that we have all the tools we need to collect each file's metadata, let's do put it into a data structure so we can do some analysis.

In [None]:
fraser_talks_metadata = {}

for file in os.listdir('UMA_Fraser_Radio_Talks'):
    # If anything goes wrong, we will know which files to look at.
    try:
        filename = os.path.join('UMA_Fraser_Radio_Talks', file)
        encoding = get_charset(filename)
        text = open(filename, 'r', encoding=encoding).read()
    except:
        print("file is", filename, " and it's chardet data is", get_charset(filename))
        continue
    
    
    #split text of file on 'end metadata'
    data = text.split("<!--end metadata-->")
    
    #parse metadata using previously defined function "parse_metadata"
    metadata = parse_metadata(data[0])
    talk_data = data[1]
      
    fraser_talks_metadata[file]={'metadata':metadata, 'talk_data':talk_data}
    


## Every error will be fixed. 

Something is wrong with that file. We will need to open it up to take a look. When we [open the file in jupyter notebook](UMA_Fraser_Radio_Talks/UDS2013680-152-full.txt) we can see at the very bottom two odd looking characters.

![](imgs/weird_characters.png "asda") 
             
What we have found is a document with conflicted information about what character encoding it is - those two characters are usual. On inspection, they are 0xC (the Form Feed character) and 0xDC (unknown). 

Since it's only one file and the process for discovery is nerdy and tedious, we will re-run the above with what professionals call [a *hack*](workingprocess.ipynb). I will put in a one time exception for this single file. You will not pass a PhD with hacks.

In [None]:
fraser_talks_metadata = {}

for file in os.listdir('UMA_Fraser_Radio_Talks'):
    # If anything goes wrong, we will know which files to look at.
    try:
        filename = os.path.join('UMA_Fraser_Radio_Talks', file)
        encoding = get_charset(filename)
        text = open(filename, 'r', encoding=encoding).read()
    except:
        # Special case: open the file in as binary, only read the first 2650 bytes
        # once they are read, decode the binary as ascii
        file_handle = open(filename, 'rb')
        text = file_handle.read(2650).decode('ascii')
            
    #split text of file on 'end metadata'
    data = text.split("<!--end metadata-->")
    
    #parse metadata using previously defined function "parse_metadata"
    metadata = parse_metadata(data[0])
    talk_data = data[1]
      
    fraser_talks_metadata[file]={'metadata':metadata, 'talk_data':talk_data}

fraser_talks_metadata['UDS2013680-152-full.txt']['metadata']

In [None]:
from collections import Counter

#fraser_talks_metadata.keys()

dates = []

for file_id in fraser_talks_metadata:
    date = fraser_talks_metadata[file_id]['metadata']['Date']
    if date.startswith('c'):
    # date format cyyyy
        year = date[1:]
    elif len(date) == 10:
    # date format dd/mm/yyyy
        year = date[6:]
    elif len(date) == 9:
    # date format d/mm/yyyy
        year = date[5:]
    if len(year) == 5:
    # pesky space in 1969
        year = year.lstrip()
    # Let's add the year to the metadata for later
    fraser_talks_metadata[file_id]['metadata'] = {'year':year}
    dates.append(year)
        
Counter(dates)

Can we order those years? Probably. At the moment they are strings, but we will want to order them as integers.

In [None]:
from collections import OrderedDict

data_summary = OrderedDict(sorted(Counter(dates).items(), key=lambda t: t))
data_summary


In [None]:
### TODO describe splicing arrays and strings

In [None]:
#### TODO plot that distribution

In [None]:
import matplotlib.pyplot as plt

#data_summary = Counter(dates)

# we make a bar graph, the y axis will be the values in the data_summary
plt.bar(range(len(data_summary)), list(data_summary.values()),align='center')

# we add in the x axis details - and turn the years around so they fit
plt.xticks(range(len(data_summary)), list(data_summary.keys()),rotation=90)

plt.show()

Ok. That looks great. Right away we can see that something momentous is happening in 1966 and 1967 - the lowest and highest number of talks respectively. If we look at his Wikipedia page, we can see that '66 is the year that he becomes a Government minister for the first time - and it's the Ministry for the Army as we enter the Vietnam War. 

Those records will be interesting to look at more closely later.

### Linguistic analysis

There are many things that can now be done. What theat might be depends on how much linguistic gymnastics you want to do. 

I'm going to start with some simple "shape of the data" anaylses. Let's see the length of each talk, averages across the years, number of sentences - easy stuff to do with what we've learnt. 

Then we will utilise some of spaCy's real power to count how many of each of the [parts of speech](https://spacy.io/api/annotation#pos-tagging) exist.

In [None]:
import spacy
#nlp = spacy.load('en_core_web_sm')
nlp = spacy.load("en")

fraser_talks_data_shape = {}

for file_id in fraser_talks_metadata:
    year = fraser_talks_metadata[file_id]['metadata']['year']
    doc = fraser_talks_metadata[file_id]['talk_data']
    nlp_doc = nlp(doc)
    
    word_count = len(nlp_doc)
    
    sentences = list(nlp_doc.sents)
    sentence_count = len(sentences)
      
    fraser_talks_data_shape[file_id] = {'word_count':word_count,'sentences':sentence_count, 'year':year}

    
    #Now let's count the Parts of Speech
    # Returns integers that map to parts of speech
    counts_dict = nlp_doc.count_by(spacy.attrs.IDS['POS'])

    # Print the human readable part of speech tags
    for pos, count in counts_dict.items():
        human_readable_tag = nlp_doc.vocab[pos].text
        fraser_talks_data_shape[file_id][human_readable_tag] = count


In [None]:
for talk in list(fraser_talks_data_shape)[0:3]:
    print(talk,":",fraser_talks_data_shape[talk],"\n")

### Lets get some common word counts

In [None]:
#from collections import Counter
#nlp = spacy.load('en')

decade_stats = {}

for file_id in list(fraser_talks_metadata)[0:3]:
    year = fraser_talks_metadata[file_id]['metadata']['year']
    doc = fraser_talks_metadata[file_id]['talk_data']
    nlp_doc = nlp(doc)

    # all tokens that arent stop words, punctuation or white space.
    # spaCy prides itself on it's token list being an exact replica of the source.
    # We don't want "line end" to be the most common word
    words = [token.text for token in nlp_doc if token.is_stop != True and token.is_punct != True and token.is_space != True]

    # noun tokens that arent stop words or punctuations
    nouns = [token.text for token in nlp_doc if token.is_stop != True and token.is_punct != True and token.is_space != True and token.pos_ == "NOUN"]

    # five most common tokens
    word_freq = Counter(words)
    common_words = word_freq.most_common(5)

    # five most common noun tokens
    noun_freq = Counter(nouns)
    common_nouns = noun_freq.most_common(5)
        
    if year in decade_stats:
        decade_stats[year]['words'] + words
        decade_stats[year]['nouns'] + nouns
    else:
        decade_stats[year] = {'words':words, 'nouns':nouns}

    print(file_id, '\n','common words', common_words,'\n' 'common nouns', common_nouns, '\n')
        
for year in decade_stats.keys():
    # five most common tokens
    word_freq = Counter(decade_stats[year]['words'])
    common_words = word_freq.most_common(5)

    # five most common noun tokens
    noun_freq = Counter(decade_stats[year]['nouns'])
    common_nouns = noun_freq.most_common(5)

    print(year, common_words, common_nouns)

In [None]:
#from collections import Counter
#nlp = spacy.load('en')

decade_stats = {}

for file_id in list(fraser_talks_metadata):
    year = fraser_talks_metadata[file_id]['metadata']['year']
    doc = fraser_talks_metadata[file_id]['talk_data']
    nlp_doc = nlp(doc.lower())

    # all tokens that arent stop words, punctuation or white space.
    # spaCy prides itself on it's token list being an exact replica of the source.
    # We don't want "line end" to be the most common word
    words = [token.text for token in nlp_doc if token.is_stop != True and token.is_punct != True and token.is_space != True]

    # noun tokens that arent stop words or punctuations
    nouns = [token.text for token in nlp_doc if token.is_stop != True and token.is_punct != True and token.is_space != True and token.pos_ == "NOUN"]

    # five most common tokens
    # and lets lowercase everything
    word_freq = Counter(words)
    common_words = word_freq.most_common(5)

    # five most common noun tokens
    noun_freq = Counter(nouns)
    common_nouns = noun_freq.most_common(5)
    
    fraser_talks_data_shape[file_id]['common words'] = common_words
    fraser_talks_data_shape[file_id]['common nouns'] = common_nouns
    
    decade = year[2]
    
    # per year information
    if year in decade_stats:
        decade_stats[year]['words'] + words
        decade_stats[year]['nouns'] + nouns
    else:
        decade_stats[year] = {'words':words, 'nouns':nouns}

    # per decade information1957 [('Henty', 10), ('Register', 5), ('James', 4), ('Portland', 3), ('shipping', 3)] [('shipping', 3), ('years', 2), ('entry', 2), ('movements', 2), ('family', 2)]
    if decade in decade_stats:
        decade_stats[decade]['words'] + words
        decade_stats[decade]['nouns'] + nouns
    else:
        decade_stats[decade] = {'words':words, 'nouns':nouns}


for year in data_summary.keys():
    # five most common tokens
    word_freq = Counter(decade_stats[year]['words'])
    common_words = word_freq.most_common(5)
    decade_stats[year]['common_words'] = common_words
    
    # five most common noun tokens
    noun_freq = Counter(decade_stats[year]['nouns'])
    common_nouns = noun_freq.most_common(5)
    decade_stats[year]['common_nouns'] = common_nouns
    
    #print(year, common_words, common_nouns)
    print(year, common_words[0], common_nouns[0])



In [None]:
for decade in '5','6','7','8':
    # five most common tokens
    word_freq = Counter(decade_stats[decade]['words'])
    common_words = word_freq.most_common(5)
    decade_stats[decade]['common_words'] = common_words
    
    # five most common noun tokens
    noun_freq = Counter(decade_stats[decade]['nouns'])
    common_nouns = noun_freq.most_common(5)
    decade_stats[decade]['common_nouns'] = common_nouns
    
    #print(year, common_words, common_nouns)
    print(decade, common_words, common_nouns)


In [None]:
import json
with open('data.json', 'w') as fp:
    json.dump([fraser_talks_data_shape,fraser_talks_metadata,data_summary,decade_stats], fp)
    

In [None]:
# lemmitization is the linguistic process of grouping together similar words by decomposing them to their common 
# form

""" 
From the [Stanford NLP book](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related 
forms of a word to a common base form. For instance:

    am, are, is -> be
    car, cars, car's, cars' -> car 

The result of this mapping of text will be something like:

    the boy's cars are different colors ->
    the boy car be differ color 
"""
    
for file_id in list(fraser_talks_metadata):
    year = fraser_talks_metadata[file_id]['metadata']['year']
    doc = fraser_talks_metadata[file_id]['talk_data']
    nlp_doc = nlp(doc.lower())

    # all tokens that arent stop words, punctuation or white space.
    # spaCy prides itself on it's token list being an exact replica of the source.
    # We don't want "line end" to be the most common word
    lemmas = [token.lemma_ for token in nlp_doc if token.is_stop != True and token.is_punct != True and token.is_space != True]

    # five most common tokens
    # and lets lowercase everything
    lemmas_freq = Counter(lemmas)
    common_lemmas = lemmas_freq.most_common(5)

    fraser_talks_data_shape[file_id]['common lemmas'] = common_lemmas
       
    decade = year[2]
    
    # per year information
    decade_stats[year] = {'lemmas':lemmas}
    
    # per decade information
    decade_stats[decade] = {'lemmas':lemmas}


for year in data_summary.keys():
    # five most common tokens
    lemmas_freq = Counter(decade_stats[year]['lemmas'])
    common_lemmas = word_freq.most_common(5)
    decade_stats[year]['common_lemmas'] = common_lemmas
    
    #print(year, common_words, common_nouns)
    print(year, common_lemmas)
