<br>
<img style="float:left" src="http://ipython.org/_static/IPy_header.png" />
<br>

# Session 2: Loading text, tokenisation, tagging, dictionaries and ngrams

In [144]:
from __future__ import print_function, division

import sys
import nltk
from IPython.display import display, clear_output
sys.path.append("/usr/lib/python2.7/site-packages/")
%matplotlib inline

In [145]:
from nltk.book import *

**Welcome back!**

So, what did we learn yesterday? A brief recap:

* The **IPython** Notebook
* **Python**: syntax, variables, functions, etc.

Today's focus will be on **developing more advanced NLTK skills** and using these skills to **investigate our own data**. 

*Any questions or anything before we dive in?*

## Dirty data

Now that we're going beyond nltk example data, we're bond to run into dirty data.

A common part of corpus building is corpus cleaning. Reasons for cleaning include:

1. Not break the code with unexpected input
2. Ensure that searches match as many examples as possible
3. Increasing readability, the accuracy of taggers, stemmers, parsers, etc.

The level of kind of cleaning depends on your data, the aims of your project and where you are in your research. In the case of very clean data (lucky you!), there may be little that needs to be done. With messy data, you may need to go as far as to correct variant spellings (online conversation, very old books).

If you need help with data cleaning, we offer trainings in [OpenRefine](https://github.com/yuandra/2016-02-01-data-acquisition-cleaning/blob/gh-pages/open-refine-01-intro.md)

### Discussion

*What are the characteristics of clean and messy data? Any personal experiences? Discuss with your neighbours.* 

It will be important to bear these characteristics in mind once you start building your own datasets and corpora. 

## Uploading text files

First of all, let's load in our text.

Google the Gutenberg Project and download a book as a plain text file. 

I chose [A Modest Proposal](https://www.gutenberg.org/ebooks/1080)

We can also look at file contents within the IPython Notebook itself:

In [146]:
import os

In [147]:
# import tokenizers
from nltk import word_tokenize
from nltk.text import Text

In [148]:
text_path = '/home/researcher/modest_proposal.txt'

In [149]:
file = open(os.path.join(text_path), "r", encoding='UTF-8')
text = file.read()
print(text)

﻿The Project Gutenberg EBook of A Modest Proposal, by Jonathan Swift

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: A Modest Proposal
       For preventing the children of poor people in Ireland,
       from being a burden on their parents or country, and for
       making them beneficial to the publick - 1729

Author: Jonathan Swift

Posting Date: July 27, 2008 [EBook #1080]
Release Date: October 1997

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK A MODEST PROPOSAL ***




Produced by An Anonymous Volunteer





A MODEST PROPOSAL

For preventing the children of poor people in Ireland, from being a
burden on their parents or country, and for making them beneficial to
the publick.

by Dr. Jonathan Swift


1729



It is a melancholy object to those, who walk throu

The books were were working with yesterday had already had some processing done on them so that we could use NLTK to find features of the language. Remember that Python regards a text file as a single long string of characters. The first thing to do is to start breaking the text up into sentences and words.

In [151]:
from nltk import word_tokenize
text = open('/home/researcher/modest_proposal.txt', "r", encoding='UTF-8').read() 
tokens = word_tokenize(text)
print(tokens[:100])

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'A', 'Modest', 'Proposal', ',', 'by', 'Jonathan', 'Swift', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org', 'Title', ':', 'A', 'Modest', 'Proposal', 'For', 'preventing', 'the', 'children', 'of', 'poor', 'people', 'in', 'Ireland', ',', 'from', 'being', 'a', 'burden', 'on', 'their', 'parents', 'or', 'country', ',', 'and', 'for', 'making', 'them', 'beneficial', 'to', 'the', 'publick', '-', '1729', 'Author', ':', 'Jonathan', 'Swift', 'Posting', 'Date', ':']


**Challenge!**

1. Find a .txt file from the Gutenberg Project or elsewhere and upload it to the Jupyter Notebook. 
2. Use the word_tokenize to break up the text data. 
3. Print the first 100 tokens.

Breaking a speech into tokens lets us do the sort of word counting that we were doing yesterday on the speeches. We can do some more interesting linguistic analysis if we use Part of Speech tagging. NLTK has a number of different Part of Speech tags that we could use, but the simplest one is called 'Universal', and we'll use that here.

In [31]:
sentence = "They refuse to permit us the refuse permit"
words = word_tokenize(sentence)
tagged = nltk.pos_tag(words, tagset='universal')
print(tagged)

[('They', u'PRON'), ('refuse', u'VERB'), ('to', u'PRT'), ('permit', u'VERB'), ('us', u'PRON'), ('the', u'DET'), ('refuse', u'NOUN'), ('permit', u'NOUN')]


Part of Speech tagging creates bigrams, that is, it associates the word with its tag in a pair of items that we can see above in brackets.  

In [32]:
tag_fd = nltk.FreqDist(tag for (word, tag) in tagged)
tag_fd.most_common()

[(u'PRON', 2), (u'VERB', 2), (u'NOUN', 2), (u'DET', 1), (u'PRT', 1)]

**Challenge!**

Use Part of Speech tagging to tag the text that we have just tokenised the do the following:
* Find the most common parts of speech
* Find the most common verbs and create a frequency Distribution graph of your result
* Find the 10 most common nouns in the text

*Hint: to find the most common verbs and nouns, you will need to create a list that contains only the verbs or only the nouns from the speech. Use a for loop to create your list. Then create a frequency distribution*

In [94]:
tagged_text = nltk.pos_tag(tokens, tagset = 'universal')
text_fd = nltk.FreqDist(tag for (word, tag) in tagged_text)
text_fd.most_common()

[('NOUN', 1871),
 ('VERB', 1082),
 ('ADP', 905),
 ('.', 846),
 ('DET', 751),
 ('ADJ', 503),
 ('PRON', 399),
 ('CONJ', 314),
 ('ADV', 296),
 ('PRT', 208),
 ('NUM', 129)]

In [98]:
verblist = []
for (word, tag) in tagged_text:
    if tag == 'VERB':
        verblist.append(word)
# Check the length of the list of verbs. 
#If it matches the number of verbs above, you can be fairly sure your loop has worked as expected
print(len(verblist))
verb_fd = nltk.FreqDist(verblist)
print(verb_fd.most_common()[:10])

1082
[('be', 67), ('is', 47), ('are', 41), ('will', 41), ('have', 33), ('can', 29), ('may', 26), ('would', 18), ('being', 17), ('do', 16)]


In [101]:
nounlist = []
for (word, tag) in tagged_text:
    if tag == 'NOUN':
        nounlist.append(word)
print(nounlist[:10])
print(len(nounlist))
noun_fd = nltk.FreqDist(nounlist)
print(noun_fd.most_common()[:10])

['Project', 'Gutenberg', 'EBook', 'A', 'Modest', 'Proposal', 'Jonathan', 'Swift', 'eBook', 'use']
1871
[('Project', 80), ('Gutenberg-tm', 55), ('work', 49), ('works', 30), ('Gutenberg', 27), ('Foundation', 24), ('children', 20), ('terms', 19), ('agreement', 17), ('country', 16)]


**Extension**
There are a few things to note about this result - Project and Gutenberg have been returned as two different, very frequent nouns. Because we're humans, not computers, we know it's likely that they are often occuring together. We could test for bigrams (words that typically occur side by side) to see if this is the case. 

In order to perform this test, we must first convert our list of tokens into and NLTK text. We can then use specific NLTK functions on the text.

In [103]:
print(type(tokens))
nltk_text = nltk.Text(tokens)
print(type(nltk_text))
nltk_text.collocations()

<class 'list'>
<class 'nltk.text.Text'>
Project Gutenberg-tm; Project Gutenberg; Literary Archive; Archive
Foundation; Gutenberg Literary; United States; Gutenberg-tm
electronic; electronic works; set forth; public domain; electronic
work; Gutenberg-tm License; Jonathan Swift; per annum; copyright
holder; MODEST PROPOSAL; Modest Proposal; PROJECT GUTENBERG; twenty
thousand; year old


### Some linguistics...

*Functional linguistics* is a research area concerned with how *realised language* (lexis and grammar) work to achieve meaningful social functions.

One functional linguistic theory is *Systemic Functional Linguistics*, developed by Michael Halliday (Prof. Emeritus at University of Sydney).

Central to the theory is a division between **experiential meanings** and **interpersonal meanings**.

* Experiential meanings communicate what happened to whom, under what circumstances.
* Interpersonal meanings negotiate identities and role relationships between speakers 

Halliday argues that these two kinds of meaning are realised **simultaneously** through different parts of English grammar.

* Experiential meanings are made through **transitivity choices**.
* Interpersonal meanings are made through **mood choices**


Transitivity choices include fitting together configurations of:

* Participants (*a man, green bikes*)
* Processes (*sleep, has always been, is considering*)
* Circumstances (*on the weekend*, *in Australia*)

Mood features of a language include:

* Mood types (*declarative, interrogative, imperative*)
* Modality (*would, can, might*)
* Lexical density--wordshe number of words per clause, the number of content to non-content words, etc.

Lexical density is usually a good indicator of the general tone of texts. The language of academia, for example, often has a huge number of nouns to verbs. We can approximate an academic tone simply by making nominally dense clauses: 

      The consideration of interest is the potential for a participant of a certain demographic to be in Group A or Group B*.

Notice how not only are there many nouns (*consideration*, *interest*, *potential*, etc.), but that the verbs are very simple (*is*, *to be*).

In comparison, informal speech is characterised by smaller clauses, and thus more verbs.

      A: Did you feel like dropping by?
      B: I thought I did, but now I don't think I want to

Here, we have only a few, simple nouns (*you*, *I*), with more expressive verbs (*feel*, *dropping by*, *think*, *want*)

> **Note**: SFL argues that through *grammatical metaphor*, one linguistic feature can stand in for another. *Would you please shut the door?* is an interrogative, but it functions as a command. *invitation* is a nominalisation of a process, *invite*. We don't have time to deal with these kinds of realisations, unfortunately.

In the context of Fraser's speech, there are nearly twice as many nouns as verbs, and the verbs are generally quite simple ones (parts of To Be and To Have make up about a quarter). This suggests that Fraser's speech, even when giving a radio talk to his electorate, is more towards the formal end of the spectrum. 

## Recap
So far today we have:
* Imported text into NLTK
* Tokenised raw text into words
* Tagged words as parts of speech
* Converted a list into NLTK Text for further analysis

## Stopwords
Yesterday, when we did our frequency counts of the books in the NLTK Library, we noticed that a lot of speace was taken up by little words like 'and' and 'of' and 'the' which don't add a lot to our understanding of text. These are called 'stop words'. It will help our analysis if we exclude them.

In [104]:
fdist1 = nltk.FreqDist(tokens)
fdist1.most_common()[:20]

[(',', 510),
 ('the', 327),
 ('of', 236),
 ('.', 183),
 ('to', 182),
 ('and', 177),
 ('a', 141),
 ('in', 125),
 ('or', 106),
 ('Project', 83),
 ('be', 67),
 ('for', 63),
 ('this', 60),
 ('with', 58),
 ('by', 55),
 ('Gutenberg-tm', 55),
 ('I', 55),
 ('you', 52),
 ('that', 51),
 ('work', 50)]

In [105]:
print(len(nltk_text))
print(len(set(nltk_text)))

7304
1795


In [106]:
#First let's get rid of the puncutation
text = [word for word in nltk_text if word.isalpha()]
print(len(text))#Then get rid of capitals
vocab = [word.lower() for word in text]
print(len(set(vocab)))

6273
1551


In [107]:
from nltk.corpus import stopwords
#Create a variable that contains all the stopwords in the NLTK corpus
ignored_words = nltk.corpus.stopwords.words('english')
unstopped = [word for word in vocab if word not in stopwords.words('english')]
fdist2 = nltk.FreqDist(unstopped)
fdist2.most_common()[:20]

[('project', 88),
 ('work', 51),
 ('works', 32),
 ('gutenberg', 30),
 ('electronic', 27),
 ('may', 26),
 ('foundation', 25),
 ('terms', 21),
 ('children', 20),
 ('agreement', 18),
 ('would', 18),
 ('one', 17),
 ('country', 16),
 ('thousand', 15),
 ('kingdom', 15),
 ('donations', 15),
 ('upon', 15),
 ('license', 15),
 ('states', 14),
 ('number', 14)]

The list we have now is probably more intersting if we wanted to get a sense of the key issues in the text. Note, we're working with a very small sample here. This sort of analysis is much more useful over really big corpora.

*Note: We could have condensed the first two steps into a single line of code that looked like this:*

        unstopped = [word for word in speech if word.lower() not in stopwords.words('english') and word.isalpha()]

## Collocation
We've just used collocation to test a hypothesis about the most common nouns in the speech we were investigating. Collocation can be quite a powerful tool for finding features of language.

First, let's look for bigrams in the whole list of tokens:

In [108]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
sorted(finder.nbest(bigram_measures.raw_freq, 10))

[(',', 'and'),
 (',', 'or'),
 (',', 'that'),
 (',', 'the'),
 ('Project', 'Gutenberg'),
 ('Project', 'Gutenberg-tm'),
 ('in', 'the'),
 ('of', 'the'),
 ('the', 'Project'),
 ('to', 'the')]

That doesn't tell us much. Let's try again with 'unstopped' our list of tokens with the punctuation and stopwords removed

In [109]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(unstopped)
sorted(finder.nbest(bigram_measures.raw_freq, 10))

[('archive', 'foundation'),
 ('electronic', 'work'),
 ('electronic', 'works'),
 ('gutenberg', 'literary'),
 ('literary', 'archive'),
 ('project', 'electronic'),
 ('project', 'gutenberg'),
 ('project', 'license'),
 ('terms', 'agreement'),
 ('united', 'states')]

As well as identifying collocations (words that appear near each other), we can also look for n-grams or clusters, which appear immediately adjacent to each other. Repeated N-grams are a good way to get a sense of what a text is about. First, let's see how n-grams are created:

In [131]:
print(sent2)

['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']


In [129]:
from nltk.util import ngrams
trigrams = ngrams(sent2, 3)
for gram in trigrams:
    print(gram)

('The', 'family', 'of')
('family', 'of', 'Dashwood')
('of', 'Dashwood', 'had')
('Dashwood', 'had', 'long')
('had', 'long', 'been')
('long', 'been', 'settled')
('been', 'settled', 'in')
('settled', 'in', 'Sussex')
('in', 'Sussex', '.')


There are a lot of trigrams in the sentence, and they don't tell us much. It's when n-grams are repeated that they start to get interesting, but before we write code the code for that we need to have some knowledge of dictionaries...

### Building a dictionaries

We've already worked with strings and lists. Another kind of data structure in Python is a dictionary.
Here is how a simple dictionary works:

In [141]:
# create a dictionary
commonwords = {'the': 4023, 'of': 3809, 'a': 3098}
# search the dictionary for 'of'
commonwords['of']

3809

In [142]:
type(commonwords)

dict

The point of dictionaries is to store a key (the word) and a value (the count). When you ask for the key, you get its value.

Notice that you use curly braces for dictionaries, but square brackets for lists.

### Finding duplicate ngrams

In [126]:
import operator
from collections import Counter
threshold = 2
ng = 3
testtext = tokens

#Create out ngram, convert to a list, 
#run a counter to count the number of entries for each unique list element
raw_grams = ngrams(testtext, ng)
listgrams = list(raw_grams)
counts = Counter(listgrams)
print(len(listgrams), len(counts))
#Create a regular dictionary, this is mostly done so we can ignore Counter values less than threshold
D = {}
for k,v in counts.items():
    if v > threshold:
        D[k] = v
#Here is a way to sort a dictionary, based on the value (key=operator.itemgetter(1))
sorted_x = sorted(D.items(), key=operator.itemgetter(1), reverse=True)

7302 6540


In [127]:
sorted_x

[(('Project', 'Gutenberg-tm', 'electronic'), 18),
 (('the', 'Project', 'Gutenberg'), 15),
 (('Project', 'Gutenberg', 'Literary'), 13),
 (('Gutenberg', 'Literary', 'Archive'), 13),
 (('Literary', 'Archive', 'Foundation'), 13),
 (('Gutenberg-tm', 'electronic', 'works'), 12),
 (('the', 'Project', 'Gutenberg-tm'), 12),
 (('the', 'terms', 'of'), 12),
 (('terms', 'of', 'this'), 10),
 (('of', 'this', 'agreement'), 10),
 (('the', 'United', 'States'), 9),
 (('set', 'forth', 'in'), 8),
 (('.', 'If', 'you'), 8),
 (('of', 'Project', 'Gutenberg-tm'), 8),
 (('Project', 'Gutenberg-tm', 'License'), 8),
 (('of', 'the', 'Project'), 8),
 (('to', 'the', 'Project'), 7),
 (('Gutenberg-tm', 'electronic', 'work'), 6),
 (('this', 'agreement', ','), 6),
 (('terms', 'of', 'the'), 5),
 ((',', 'you', 'must'), 5),
 (('.', 'You', 'may'), 5),
 (('Project', 'Gutenberg', "''"), 5),
 (('``', 'Project', 'Gutenberg'), 5),
 (('.', 'I', 'have'), 5),
 (('full', 'Project', 'Gutenberg-tm'), 5),
 (('in', 'the', 'United'), 5),
 

This last bit of code is more advanced. Don't worry if you forget what every line means. If you are interested getting more comfortable with Python, come to our [Python]('https://github.com/resbaz/2015-12-14-Python-for-Researchers') course.

# Web scraping using Beautiful Soup

The most important skill for using NLTK in your life as a researchers is going to be working with your own texts. First, let's look at reading in text files directly from the web.

Of course, a lot of the text you're going to want to work with won't be in handy text files already. That's where a Python library called Beautiful Soup comes in.

*Note*: the ! is a way of accessing command line functions from the notebook. We could also do this in the terminal (without the !). 

In [156]:
!sudo pip3 install BeautifulSoup4
from urllib.request import urlopen

Cleaning up...


In [153]:
from bs4 import BeautifulSoup

In [154]:
url = "http://en.wikipedia.org/wiki/Smog"

In [160]:
raw = urlopen(url).read()
print(type(raw))
print(raw[100:200])

<class 'bytes'>
b'>Smog - Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className = docume'


Beautiful Soup breaks the single long string into its constituent parts, creating an object 'Beautiful Soup'

In [161]:
soup = BeautifulSoup(raw, 'html.parser')
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [162]:
texts = []
for para in soup.find_all('p'):
    text = para.text
    texts.append(text)
print(texts[:10])

['Smog is a type of air pollutant. The word "smog" was coined in the early 20th century as a portmanteau of the words smoke and fog to refer to smoky fog.[1] The word was then intended to refer to what was sometimes known as pea soup fog, a familiar and serious problem in London from the 19th century to the mid 20th century. This kind of visible air pollution is composed of nitrogen oxides, sulfur oxides, ozone, smoke or particulates among others (less visible pollutants include carbon monoxide, CFCs and radioactive sources). Man-made smog is derived from coal emissions, vehicular emissions, industrial emissions, forest and agricultural fires and photochemical reactions of these emissions.', 'Modern smog, as found for example in Los Angeles, is a type of air pollution derived from vehicular emission from internal combustion engines and industrial fumes that react in the atmosphere with sunlight to form secondary pollutants that also combine with the primary emissions to form photochemi

In [164]:
import re
regex = re.compile('\[[0-9]*\]')
joined_texts = '\n'.join(texts)
joined_texts = re.sub(regex, '', joined_texts)
print(type(joined_texts))
print(joined_texts[:100])

<class 'str'>
Smog is a type of air pollutant. The word "smog" was coined in the early 20th century as a portmante


In order to work on the text, the first step is to tokenise it into words.

In [165]:
import nltk
wordlist = nltk.word_tokenize(joined_texts)
wordlist[:8]

['Smog', 'is', 'a', 'type', 'of', 'air', 'pollutant', '.']

For some other types of analysis, we'll need to create an NLTK text object

In [166]:
good_text = nltk.Text(wordlist)
good_text.concordance('smog')

Displaying 25 of 39 matches:
                                     Smog is a type of air pollutant . The wor
                                     smog '' was coined in the early 20th cent
and radioactive sources ) . Man-made smog is derived from coal emissions , veh
eactions of these emissions . Modern smog , as found for example in Los Angele
mary emissions to form photochemical smog . In certain other cities , such as 
rtain other cities , such as Delhi , smog severity is often aggravated by stub
fe or death . Coinage of the term `` smog '' is generally attributed to Dr. He
 clouds of smoke that contributes to smog . Air pollution from this source has
 , as witnessed by the 2013 autumnal smog in Harbin , China , which closed roa
 major ingredient in the creation of smog in some large cities . The major cul
 ozone , and particles that comprise smog . Photochemical smog is the chemical
s that comprise smog . Photochemical smog is the chemical reaction of sunlight
active and oxidizing . 

And once we've done all that work creating clean text, it's a good idea to save it for later.

In [167]:
%cd
! mkdir smog
%cd smog

/home/researcher
mkdir: cannot create directory 'smog': File exists
/home/researcher/smog


In [168]:
NLTK_file = open("NLTK-Smog.txt", "w", encoding='UTF-8')
NLTK_file.write(str(wordlist))
NLTK_file.close()

In [169]:
text_file = open("Smog-text.txt", "w", encoding='UTF-8')
text_file.write(joined_texts)
text_file.close()

In [171]:
joined_texts[2450:2471]

'of this type is still'

In [172]:
#joined_texts[2450:2470]
text_file = open("Smog-text.txt", "w", encoding='UTF-8')
text_file.write(joined_texts)
text_file.close()

Now have a look at the two files you've created in the file management system. Open them. How is the nltk file different from the .txt file?

**Challenge!**
* Find a webpage of interest to your studies and use Beautiful Soup to extract the text
* Tokenise the text
* Find the most common words in your text (Extension: remove the stop words)
* Find trigrams in your text 
* Save your text to a text file

*Hint*: feel free to collude with your neighbours and please copy and paste our previous code! Copying and pasting are essential skills of developers, as well as googling error messages (seriously!). If you don't believe me, ask a computer scientist. 