# Introduction to NLTK

Use computers to identify patterns in language and textual data

## Orientation: Where am I?
<img src='res/launch_1.jpg'>


<i>Credits: Kerbal space program: Falcon 9 Space X</i>

## Our command module: Interactive python (IPython) with Jupyter Notebooks


Let's play around with our environment:
- Setting up and get ready with [Anaconda](https://www.continuum.io/downloads). It's free. 
- What is a notebook?
- What can I do in the cells?
- Most used features

### Challenge: Explore the jupyter notebooks
- Open the notebook [material](http://a.com) or make a copy for you in the cloud
- Add 5 cells to the notebook
- Delete 3 cells
- Type "Hello world" inside the last emtpy cell and run it (we will learn how to to put more things inside Next!)
- Move the cell to be the first


Now we know where we are standing... few comments:

1. The main strength of IPython is that you can run bits of code individually, so you don't have to keep repeating things. For example, if you scroll up to the last function and replace the 50 with 2, you can re-run that code and get the new answer. 
2. IPython allows you to display images alongside code, and to save the input and output together.
3. IPython makes learning a bit easier, as mistakes are easier to find and do not break an entire workflow.

### Wait.. what is python?

<img src='res/missionpython_cover.png'>

Python is easy-to-use programming language and comes with handy / efficient tools to manipulate linguistic data. We  will learn just the basics to perform reproducible research. 

## What is the Natural Language Toolkit?
<img src='res/NLTK.png'>

NLTK is a Python Library for working with written language data. 


NLTK is free and extensively documented [here](http://www.nltk.org/).
> Note: NLTK provides tools for tasks ranging from very simple (counting words in a text) to very complex (writing and training parsers, etc.)

We will start by importing NLTK, setting a path to NLTK resources, and downloading some additional stuff.

In [1]:
import nltk

We need some data to experiment with... 

In [None]:
from nltk.book import *  
# asterisk means 'everything'

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail


In [None]:
texts()

## Variables
We use variables to name temporary data in the computer.
Think of it as a nickname so we can use our data in other parts of the notebook.
Try to assign a variable to our text "hello world!".
>Hint: Don't forget quotes when writing text.

```python
greetings='Hello world!'
```


In [None]:
greetings='Hello world!!'

Now, call your variable by the nickname:

In [None]:
# call the variable greetings
greetings

In [None]:
# call the variable text5
text5

We can assign numbers to this variables:

In [None]:
age=30

It is good to choose meaningful variable names to remind you — and to help anyone else who reads your Python code — what your code is meant to do. Python does not try to make sense of the names; it blindly follows your instructions, and does not object if you do something confusing, such as ```one = 'two'``` or ```two = 3```. The only restriction is that a variable name cannot be any of Python's reserved words, such as def, if, not, and import. If you use a reserved word, Python will produce a syntax error:

In [None]:
word="Hello world!"
word

Variables can giva a name to lot of things, we call this **objects**, Python’s abstraction for data. All data in a Python program is represented by objects or by relations between objects. 
We will use operations with this objects which we call **methods** and **functions**.

## Python as a calculator
We can use the iPython environment as a caluculator; try doing some basic mathematics with python. *Hint*: use * and / like your smartphone.

In [None]:
5+10

In [None]:
age*5

You can make operations between objects using **operators**.
Operators are important and we can do more with them than multiply. We can ask IPython if something is equal to **==** or not equal to **!=** and a number of others. Try it!

In [None]:
5==5

In [None]:
'Hello'=='Hello'

## Methods and functions

<img src='res/methods_functions.jpg'>

<i>Credits: Kerbal space program: Falcon 9 Space X</i>

The syntax we'll use the most involves two types of commands: "functions and methods"; one that look like this ```len()``` and anothers that look like this ```.count()```

Both need an object (text data in our case) to work on; for example, ```len(text1)``` or ```text1.count("Whale")```.

## Python basics: A summary

- Syntax = Set of rules to define how python is written
- Python is designed to be highly readable
- Uses english keywords which are easy to understand
- Besides "Objects", we will start using "functions" and "methods": pieces of code already written that we can reuse



## Quick start: Let computers do the reading and count


### Exploring vocabulary:  Useful functions

NLTK makes it really easy to get basic information about the size of a text and the complexity of its vocabulary using **python functions**.
*Please* note that all these commands use the same *syntax*; this is the first python syntax we'll learn.

```len(text1)``` gives the number of symbols or 'tokens' in your text. This is the total number of words and items of punctuation.

In [None]:
len(text3)

```set(text2)``` gives you a list of all the tokens in the text, without the duplicates. Hence, ```len(set(text3))``` will give you the total number unique tokens. Remember this still includes punctuation. 

In [None]:
len(set(text3))

```sorted(text4)``` places items in the list into alphabetical order, with punctuation symbols and capitalised words first.

In [None]:
sorted(set(text3))[140:150]

#### Challenge: Lexical richness

We can investigate the *lexical richness* of a text. For example, by dividing the total number of words by the number of unique words, we can see the average number of times each word is used. 

For this challenge you will have to combine your knowledge of the syntax we've learnt so far and iPython's mathematical abilities.

Have a go at calculating the lexical richness of text3.

In [None]:
len(text3)/len(set(text3))

We can use methods from the object **text** to count the words

In [None]:
text3.count("God")

In general, NLTK is counting everything for us

In [None]:
text3.vocab()

#### Challenge: Percentaje taken by a word in text
Store in a variable the amount of times the word "sea" is in the text3. Then calculate the percentaje taken up by this word in the whole text

In [None]:
sea_count=text3.count("sea")
100*sea_count/len(text3)

### Exploring text - useful methods to search inside text
NLTK has useful methods that helps us to search in the text


**Concordance** shows you a word in context and is useful if you want to be able to discuss the ways in which a word is used in a text. 

In [None]:
text1.concordance("monstrous")

In [None]:
text4.concordance('health')

**Similar** will find words used in similar contexts; it is not looking for synonyms, although the results may include synonyms

In [None]:
text2.similar('man')

In [None]:
text4.similar('people')

**Common contexts** allows us to examine just the contexts that are shared by two or more words, such as monstrous and very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma:

In [None]:
text2.common_contexts(["monstrous", "very"])  # this method takes two arguments

In [None]:
text4.common_contexts(['health','war'])


We can also find words that typically occur together, which tend to be very specific to a text or genre of texts. A **collocation** is a sequence of words that occur together unusually often

In [None]:
text2.collocations()

In [None]:
text3.collocations()

#### Challenge: Text exploration

1. Find the collocations in the Inaugural Address text. 

2. Chose one of the words to concordance. 

3. Investigate how the word is used. What words are used similarly? 

4. And what are the common contexts of these words? 

5. Report your findings to the person next to you. 

6. Do the same with the chat forum.

In [None]:
text4.collocations()

In [None]:
text4.concordance('citizens')

In [None]:
text4.similar('citizens')

In [None]:
text4.common_contexts(['citizens','government'])

### Exploring text: Plotting dispersion of words
If we can find words in a text, we can also take note of their position within the text, Python lets you create graphs to analize textual data.
We can then generate a **dispersion plot** that shows where given words occur in a text.

In [None]:
text1.dispersion_plot(words=['sea','whale'])

In [None]:
text1.count('whale')

In [None]:
text1.count('sea')

In [None]:
# different roles played by the male and female protagonists in Sense and Sensibility
text2.dispersion_plot(['Elinor', 'Marianne', 'Edward', 'Willoughby'])

In [None]:
text5.dispersion_plot([":)", ":("])

#### Challenge: Dispersion of words in a text

Create a dispersion plot for the terms "democracy", "freedom", "America","Government","peace","war","happiness" and "fear" in the innaugural address corpus.
What do you think it tells you? 

In [None]:
text4.dispersion_plot(["democracy", "freedom", "America","Government","peace","war","happiness",'fear'])

## Data structures: Texts as Lists of Words
Python treats a text as a long list of words. First, we'll make some lists of our own, to give you an idea of how a list behaves.


In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']
# Note we use Square brackets here to define our list

In [None]:
sent1

In [None]:
sent1.count('me')

A few things to consider...
<br>
<img src='res/lists.jpg'>
*Credit: Head First Python by Paul Barry*

You can add lists together, creating a new list containing all the items from both lists. You can do this by typing out the two lists or you can add two or more pre-defined lists. This is called concatenation.

In [None]:
sent2=['jack','is','playing','with','the','ball']

In [None]:
sent1+sent2

You can think of text as a concatenation of sentences, and sentences as a concatenation of words

What if we want to add a single item to a list? This is known as appending. When we append() to a list, the list itself is updated as a result of the operation.

In [None]:
sent2.append('tomorrow')
sent2

#### Indexing Lists
We can navigate this list with the help of indexes. Just as we can find out the number of times a word occurs in a text, we can also find where a word first occurs. We can navigate to different points in a text without restriction, so long as we can describe where we want to be.

In [None]:
print(text4.index('awaken')) # print is a python function we can use to show the result of an text operation

This works in reverse as well. We can ask Python to locate the 158th item in our list (note that we use square brackets here, not parentheses)

In [None]:
print(text4[158])

As well as pulling out individual items from a list, indexes can be used to pull out selections of text from a large corpus to inspect. We call this **slicing**.

In [None]:
print(text5[16715:16735])

If we're asking for the beginning or end of a text, we can leave out the first or second number. For instance, [:5] will give us the first five items in a list while [8:] will give us all the elements from the eighth to the end.

In [None]:
print(text2[:10])
print(text4[145700:])

To help you understand how indexes work, let's create one.
We start by defining the name of our index and then add the items. You probably won't do this in your own work, but you may want to manipulate an index in other ways. Pay attention to the quote marks and commas when you create your test sentence.

In [None]:
sent = ['The', 'quick', 'brown', 'fox']
print(sent[0])
print(sent[2])

Note that the first element in the list is zero. This is because we are telling Python to go zero steps forward in the list. If we use an index that is too large (that is, we ask for something that doesn't exist), we'll get an error.
We can modify elements in a list by assigning new data to one of its index values. We can also replace a slice with new material.

In [None]:
sent[2] = 'furry'
sent[3] = 'child'
print(sent)

#### Challenge: Lists (homework)

Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same methods used earlier

In [None]:
sent = ['The', 'quick', 'brown', 'fox']
sent

## Strings: A useful object to store texts

A string is a sequence of characters, you can think of it as a list. For example, we can assign a string to a variable, index a string, and slice a string

In [None]:
name='Julia'

In [None]:
name[0]

We can also make operations:

In [None]:
name*5

We can join the words of a list to make a single string, or split a string into a list, as follows:

In [None]:
' '.join(['I','love', 'NLTK'])

In [None]:
'I love python'.split()

And it will be helpful to normalize your text. E.g. lowercase, uppercase

In [None]:
full_name='Daniel Gil'
full_name.lower()

## Let computers do the repetitive work: Python Loops

We can use Python to automate tasks, such as performing a function on all items in a list. For instance, we could ask it to tell us the size of all the files in a directory. To do that, we'll have to teach the computer how to repeat things. We do this by creating something called a *loop*.

An example task that we might want to repeat is printing each character in a word on a line of its own. One way to do this would be to use a series of print statements:

In [None]:
word = 'Python'
print(word[0])
print(word[1])
print(word[2])
print(word[3])
print(word[4])
print(word[5])

What are the cons of doing this do you think?

In [None]:
word = 'Monty'
print(word[0])
print(word[1])
print(word[2])
print(word[3])
print(word[4])
print(word[5])
# notice we are going to get an error: "string index out of range"

In [None]:
word = 'Python'
for char in word:
    print(char)

<img src='res/loop.jpg'>
<br>
<i>Credit: Head First Python by Paul Barry</i>

This is call a 'for loop'. Try it with other words.

The general form of a loop is:
```python
for variable in collection:
    # do this
    # do that
    ```
We can call the loop variable whatever we want, but there must be a colon at the end of the line starting the loop, and we must indent anything we want to run inside the loop. This is used to signify when the loop ends, instead of using a symbol to end the loop.

This is a loop that repeatedly updates a variable:

In [None]:
length = 0
for char in 'Python':
    length = length + 1
print('There are', length, 'letters in this word')

Let's go through this line by line.
The variable is set to 0, so that python starts counting from 0.
The second line opens a for loop that loops over the characters in Python.
Now, the third line tells python to count by one, pretty obvious to us humans... Since there are six characters in 'Python', the statement on line 3 will be executed five times.
The first time around, length is zero (the value assigned to it on line 1). The statement adds 1 to the old value of length, producing 1, and updates length to refer to that new value. The next time around, char is 'y' and length is 1, so length is updated to be 2. After four more updates, length is 6.
Since there is nothing left in 'Python', the loop finishes and the print statement on line 4 tells us our final answer inbetween two strings.

Of course, we already know that we could use ```len()``` to find the length of a string, but it's just an example... Let's compare results:

In [None]:
len("Python")

Now for another example using a list! Fruit salad, yummy, yummy.

In [None]:
fruits = ['banana', 'apple', 'mango']
for fruit in fruits:        
    print('Current fruit :', fruit)
    print('Done!')

#### Challenge: Loops (homework)
Define a list called Library with the 9 NLTK books in it. Write a for loop that will  print the lexical diversity over each book and tell you its score.

In [None]:
Library = [text1, text2, text3, text4, text5, text6, text7, text8, text9]
for book in Library:
    score = len(book)/len(set(book))
    print(book, score)

## Frecuency Distributions: Counting for analysis
We can use Python's ability to perform statistical analysis of data to do further exploration of vocabulary. For instance, we might want to be able to find the most common or least common words in a text. We'll start by looking at frequency distribution.

In [None]:
from nltk.probability import FreqDist
fdist1 = FreqDist(text1)

In [None]:
fdist1.most_common(10)

In [None]:
fdist1['like'] # number of occurrences of this word

In [None]:
fdist1.max() # top word by occurrences. (same as most_common(1))

In [None]:
fdist1.freq('a') # frequency of a word in the text

We can plot the frequency distribution to show results:

In [None]:
fdist1.plot(50,cumulative=False)

#### Challenge: Frequency distributions
Use a loop to compare the 10 most common words of the texts in the NLTK book.

In [None]:
for book in Library:
    fdist1=FreqDist(book)
    words = fdist1.most_common(10)
    print(book, words)

## Exploring vocabulary (cont. - Try this at home)
As well as counting individual words, we can count other features of vocabulary, such as how often words of different lengths occur. We do this by putting together a number of the commands we've already learned.

We could start like this: 

```[len(word) for word in text1]```

... but this would print the length of every word in the whole book, so let's skip that bit!

In [None]:
fdist2= FreqDist(len(word) for word in text1)

In [None]:
fdist2.max()

In [None]:
fdist2.freq(3)

These last two commands tell us that the most common word length is 3, and that these 3 letter words account for about 20% of the book. We can see this just by visually inspecting the list produced by ```fdist2.most_common()```
, but if this list were too long to inspect readily, or we didn't want to print it, there are other ways to explore it.

There are a number of functions defined for NLTK's frequency distributions:

 | Function | Purpose  |
 |--------------|------------|
 | fdist = FreqDist(samples) | create a frequency distribution containing the given samples |
 | fdist[sample] += 1 | increment the count for this sample |
 | fdist['monstrous']  | count of the number of times a given sample occurred |
 | fdist.freq('monstrous') | frequency of a given sample |
 | fdist.N()  |  total number of samples |
 | fdist.most_common(n)   |  the n most common samples and their frequencies |
 | for sample in fdist:   |  iterate over the items in fdist, when in the loop, we refer to each item as sample |
 | fdist.max() | sample with the greatest count |
 | fdist.tabulate()   |  tabulate the frequency distribution |
 | fdist.plot()  |   graphical plot of the frequency distribution |
 | fdist.plot(cumulative=True) | cumulative plot of the frequency distribution |
 | fdist1 < fdist2 | test if samples in fdist1 occur less frequently than in fdist2 |

It is possible to select the longest words in a text, which may tell you something about its vocabulary and style

In [None]:
vocab = set(text4)
long_words=[]
for word in vocab:
    if len(word)>15:
        long_words.append(word)
        
sorted(long_words)

We can use this other template to do exactly the same thing, in just one line. We are not going to go deep in this but if you are interested it's called **List comprehension**

In [None]:
vocab = set(text4)
long_words = [word for word in vocab if len(word) > 15]
sorted(long_words)

We can also use numerical operators to refine the types of searches we ask Python to run. We can use the following relational operators:


### Common relationals
 |  Relational | Meaning |
 |--------------:|:------------|
 | <    |  less than |
 | <=   |   less than or equal to |
 | ==  |    equal to (note this is two "=" signs, not one) |
 | !=   |   not equal to |
 | \>   |   greater than |
 | \>= |   greater than or equal to |

#### Challenge: Use operator to explore text

Using one of the pre-defined sentences in the NLTK corpus, use the relational operators above to find:
- Words longer than four characters
- Words of four or more characters
- Words of exactly four characters

In [None]:
longer = [word for word in sent2 if len(word) > 4]
more = [word for word in sent2 if len(word) >= 4]
exact = [word for word in sent2 if len(word) == 4]
print(longer, more, exact)

In [None]:
for word in sent2:
    if len(word) > 4:
        print(word)

We can fine-tune our selection even further by adding other conditions. For instance, we might want to find long words that occur frequently (or rarely).  

#### Challenge: Search with conditions (homework)

Can you find all the words in a text that are more than seven letters long and occur more than seven times?

In [None]:
fdist5 = FreqDist(text5)
sorted(word for word in set(text5) if len(word) > 7 and fdist5[word] > 7)

### Common methods for strings

 | Operator  | Purpose  |
 |--------------|------------|
 | s.startswith(t) | test if s starts with t |
 | s.endswith(t)  |  test if s ends with t | 
 | t in s         |  test if t is a substring of s | 
 | s.islower()    |  test if s contains cased characters and all are lowercase | 
 | s.isupper()    |  test if s contains cased characters and all are uppercase | 
 | s.isalpha()    |  test if s is non-empty and all characters in s are alphabetic | 
 | s.isalnum()    |  test if s is non-empty and all characters in s are alphanumeric | 
 | s.isdigit()    |  test if s is non-empty and all characters in s are digits | 
 | s.istitle()    |  test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals) | 

In [None]:
sorted(w for w in set(text1) if w.endswith('ableness'))

In [None]:
sorted(n for n in sent7 if n.isdigit())

#### Challenge (homework)

You'll remember right at the beginning we started looking at the size of the vocabulary of a text, but there were two problems with the results we got from using:

```len(set(text1))```

This count includes items of punctuation and treats capitalised and non-capitalised words as different things (*This* vs *this*). We can now fix these problems. We start by getting rid of capitalised words, then we get rid of the punctuation and numbers.

In [None]:
len(set(text1))

In [None]:
normalized_text=[word.lower() for word in set(text1)]
len(set(normalized_text))

## Explore your own text: Accessing a corpus   

**Corpus** - Structured set of texts. 
*Corpora* is the plural of this. Example: A collection of medical journals.

### Access a text file from disk

In [None]:
import os

text_path = 'books/pg1080.txt'
path=os.path.join(text_path)
print(path)
file = open(os.path.join(text_path), "r", encoding='UTF-8')
text = file.read()

### Access a text file from the web

In [None]:
from urllib import request
url = "http://www.gutenberg.org/cache/epub/1080/pg1080.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')


we can inspect the type of object we just got in python:

In [None]:
type(raw)

We have a big string, that means, a sequence of characters. We already know how to work on this. For example, can we know the size of this string?

In [None]:
len(raw)

In [None]:
raw[:5]

We are interested in words and sentences. In our first lessons, we analyzed texts already presented as words and sentences, we can do this with an operation called "tokenization"

**Tokenization = cut the text into pieces like sentences or words**
<img src="res/token.jpg"/>

In [None]:
from nltk import word_tokenize,sent_tokenize,wordpunct_tokenize

In [None]:
tokens=word_tokenize(raw) # try different tokenizers


In [None]:
my_text=Text(tokens) # with the object Text we can explore the text as we did in the first session

In [None]:
my_text.collocations()

### Access a different format. E.g. a python object created previously - BONUS!!

In [None]:
import os
import pickle
import nltk
from nltk import word_tokenize,sent_tokenize

dir_name='/Users/danielgil/Projects/LawSchool/NSWLEC RAW SENTENCES'
filenames=os.listdir(dir_name)
raw_html=''
file_sent_tokens=list()
sent_tokens=list()

for filename in filenames:
    pkl = pickle.load( open( dir_name+"/"+filename, "rb" ) )
    file_sent_tokens.append((pkl[0],pkl[1:]))
    sent_tokens.extend(pkl[1:])

In [None]:
len(file_sent_tokens)

In [None]:
len(sent_tokens)

In [None]:
sent_tokens[:3]

In [101]:
# lets tokenize by words
word_tokens=list()
for sent in sent_tokens[:5000]:
    word_tokens+=word_tokenize(sent)
    


In [None]:
len(word_tokens)

In [None]:
nltk_text=nltk.Text(word_tokens)
nltk_text.concordance('harm',lines=30,width=2000)

### Can you find a pattern??
<img src='res/pat.jpg'>

In [None]:
We can use regular expressions 

In [None]:
token_searcher = nltk.text.TokenSearcher(sent_tokens) 
possible_fines = token_searcher.findall(regexp=r'.*\$[0-9]+.*')
possible_fines[:10]

#### Challenge: Explore your vocabulary with your text

Use a text from the Project Gutenberg

1. Explore the lexical richness
2. Calculate the percentage taken by a word
3. Find the collocations. 
4. Chose one of the words to concordance. 
5. Investigate how the word is used
6. Create a dispersion plot
7. Create a frequency distribution
8. Get the top 50 words
