<br>
<img style="float:left" src="http://ipython.org/_static/IPy_header.png" />
<br>

# Session 2: Loading text, tokenisation, tagging, dictionaries and ngrams

In [None]:
from __future__ import print_function, division

import sys
import nltk
from IPython.display import display, clear_output
sys.path.append("/usr/lib/python2.7/site-packages/")
%matplotlib inline

In [None]:
from nltk.book import *

**Welcome back!**

So, what did we learn yesterday? A brief recap:

* The **IPython** Notebook
* **Python**: syntax, variables, functions, etc.

Today's focus will be on **developing more advanced NLTK skills** and using these skills to **investigate our own data**. 

*Any questions or anything before we dive in?*

## Dirty data

Now that we're going beyond nltk example data, we're bond to run into dirty data.

A common part of corpus building is corpus cleaning. Reasons for cleaning include:

1. Not break the code with unexpected input
2. Ensure that searches match as many examples as possible
3. Increasing readability, the accuracy of taggers, stemmers, parsers, etc.

The level of kind of cleaning depends on your data, the aims of your project and where you are in your research. In the case of very clean data (lucky you!), there may be little that needs to be done. With messy data, you may need to go as far as to correct variant spellings (online conversation, very old books).

If you need help with data cleaning, we offer trainings in [OpenRefine](https://github.com/yuandra/2016-02-01-data-acquisition-cleaning/blob/gh-pages/open-refine-01-intro.md)

### Discussion

*What are the characteristics of clean and messy data? Any personal experiences? Discuss with your neighbours.* 

It will be important to bear these characteristics in mind once you start building your own datasets and corpora. 

## Uploading text files

First of all, let's load in our text.

Google the Gutenberg Project and download a book as a plain text file. 

I chose [A Modest Proposal](https://www.gutenberg.org/ebooks/1080)

We can also look at file contents within the IPython Notebook itself:

The books were were working with yesterday had already had some processing done on them so that we could use NLTK to find features of the language. Remember that Python regards a text file as a single long string of characters. The first thing to do is to start breaking the text up into sentences and words.

**Challenge!**

1. Find a .txt file from the Gutenberg Project or elsewhere and upload it to the Jupyter Notebook. 
2. Use the word_tokenize to break up the text data. 
3. Print the first 100 tokens.

Breaking a speech into tokens lets us do the sort of word counting that we were doing yesterday on the speeches. We can do some more interesting linguistic analysis if we use Part of Speech tagging. NLTK has a number of different Part of Speech tags that we could use, but the simplest one is called 'Universal', and we'll use that here.

Part of Speech tagging creates bigrams, that is, it associates the word with its tag in a pair of items that we can see above in brackets.  

**Challenge!**

Use Part of Speech tagging to tag the text that we have just tokenised the do the following:
* Find the most common parts of speech
* Find the most common verbs and create a frequency Distribution graph of your result
* Find the 10 most common nouns in the text

*Hint: to find the most common verbs and nouns, you will need to create a list that contains only the verbs or only the nouns from the speech. Use a for loop to create your list. Then create a frequency distribution*

**Extension**
There are a few things to note about this result - Project and Gutenberg have been returned as two different, very frequent nouns. Because we're humans, not computers, we know it's likely that they are often occuring together. We could test for bigrams (words that typically occur side by side) to see if this is the case. 

In order to perform this test, we must first convert our list of tokens into and NLTK text. We can then use specific NLTK functions on the text.

### Some linguistics...

*Functional linguistics* is a research area concerned with how *realised language* (lexis and grammar) work to achieve meaningful social functions.

One functional linguistic theory is *Systemic Functional Linguistics*, developed by Michael Halliday (Prof. Emeritus at University of Sydney).

Central to the theory is a division between **experiential meanings** and **interpersonal meanings**.

* Experiential meanings communicate what happened to whom, under what circumstances.
* Interpersonal meanings negotiate identities and role relationships between speakers 

Halliday argues that these two kinds of meaning are realised **simultaneously** through different parts of English grammar.

* Experiential meanings are made through **transitivity choices**.
* Interpersonal meanings are made through **mood choices**


Transitivity choices include fitting together configurations of:

* Participants (*a man, green bikes*)
* Processes (*sleep, has always been, is considering*)
* Circumstances (*on the weekend*, *in Australia*)

Mood features of a language include:

* Mood types (*declarative, interrogative, imperative*)
* Modality (*would, can, might*)
* Lexical density--wordshe number of words per clause, the number of content to non-content words, etc.

Lexical density is usually a good indicator of the general tone of texts. The language of academia, for example, often has a huge number of nouns to verbs. We can approximate an academic tone simply by making nominally dense clauses: 

      The consideration of interest is the potential for a participant of a certain demographic to be in Group A or Group B*.

Notice how not only are there many nouns (*consideration*, *interest*, *potential*, etc.), but that the verbs are very simple (*is*, *to be*).

In comparison, informal speech is characterised by smaller clauses, and thus more verbs.

      A: Did you feel like dropping by?
      B: I thought I did, but now I don't think I want to

Here, we have only a few, simple nouns (*you*, *I*), with more expressive verbs (*feel*, *dropping by*, *think*, *want*)

> **Note**: SFL argues that through *grammatical metaphor*, one linguistic feature can stand in for another. *Would you please shut the door?* is an interrogative, but it functions as a command. *invitation* is a nominalisation of a process, *invite*. We don't have time to deal with these kinds of realisations, unfortunately.

In the context of Fraser's speech, there are nearly twice as many nouns as verbs, and the verbs are generally quite simple ones (parts of To Be and To Have make up about a quarter). This suggests that Fraser's speech, even when giving a radio talk to his electorate, is more towards the formal end of the spectrum. 

## Recap
So far today we have:
* Imported text into NLTK
* Tokenised raw text into words
* Tagged words as parts of speech
* Converted a list into NLTK Text for further analysis

## Stopwords
Yesterday, when we did our frequency counts of the books in the NLTK Library, we noticed that a lot of speace was taken up by little words like 'and' and 'of' and 'the' which don't add a lot to our understanding of text. These are called 'stop words'. It will help our analysis if we exclude them.

The list we have now is probably more intersting if we wanted to get a sense of the key issues in the text. Note, we're working with a very small sample here. This sort of analysis is much more useful over really big corpora.

*Note: We could have condensed the first two steps into a single line of code that looked like this:*

        unstopped = [word for word in speech if word.lower() not in stopwords.words('english') and word.isalpha()]

## Collocation
We've just used collocation to test a hypothesis about the most common nouns in the speech we were investigating. Collocation can be quite a powerful tool for finding features of language.

First, let's look for bigrams in the whole list of tokens:

That doesn't tell us much. Let's try again with 'unstopped' our list of tokens with the punctuation and stopwords removed

As well as identifying collocations (words that appear near each other), we can also look for n-grams or clusters, which appear immediately adjacent to each other. Repeated N-grams are a good way to get a sense of what a text is about. First, let's see how n-grams are created:

There are a lot of trigrams in the sentence, and they don't tell us much. It's when n-grams are repeated that they start to get interesting, but before we write code the code for that we need to have some knowledge of dictionaries...

### Building a dictionaries

We've already worked with strings and lists. Another kind of data structure in Python is a dictionary.
Here is how a simple dictionary works:

The point of dictionaries is to store a key (the word) and a value (the count). When you ask for the key, you get its value.

Notice that you use curly braces for dictionaries, but square brackets for lists.

### Finding duplicate ngrams

This last bit of code is more advanced. Don't worry if you forget what every line means. If you are interested getting more comfortable with Python, come to our [Python]('https://github.com/resbaz/2015-12-14-Python-for-Researchers') course.

# Web scraping using Beautiful Soup

The most important skill for using NLTK in your life as a researchers is going to be working with your own texts. First, let's look at reading in text files directly from the web.

Of course, a lot of the text you're going to want to work with won't be in handy text files already. That's where a Python library called Beautiful Soup comes in.

*Note*: the ! is a way of accessing command line functions from the notebook. We could also do this in the terminal (without the !). 

Beautiful Soup breaks the single long string into its constituent parts, creating an object 'Beautiful Soup'

In order to work on the text, the first step is to tokenise it into words.

For some other types of analysis, we'll need to create an NLTK text object

And once we've done all that work creating clean text, it's a good idea to save it for later.

Now have a look at the two files you've created in the file management system. Open them. How is the nltk file different from the .txt file?

**Challenge!**
* Find a webpage of interest to your studies and use Beautiful Soup to extract the text
* Tokenise the text
* Find the most common words in your text (Extension: remove the stop words)
* Find trigrams in your text 
* Save your text to a text file

*Hint*: feel free to collude with your neighbours and please copy and paste our previous code! Copying and pasting are essential skills of developers, as well as googling error messages (seriously!). If you don't believe me, ask a computer scientist. 