# Day Two

Before we start, make sure you import nltk into your new notebook, as well as numpy and the example texts.

In [13]:
import nltk
import numpy
%matplotlib inline
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Tokenization
In general, Python regards a text file as a single long string of characters. Tokenization breaks text into words that the computer can understand as discrete units. Here is an example of one of NLTK's tokenizers at work:

First, we import the special twitter tokenizer from nltk

In [3]:
from nltk.tokenize.casual import (TweetTokenizer, casual_tokenize)

Next, save a tweet as a variable. In computer programming, variables are data (e.g. “PM @TurnbullMalcolm: Under changes agreed...”) paired with an associated symbolic name or identifier (e.g. 'tweet' in the code below). We'll learn more about these later, but here's how you assign data a variable. I've given my tweet the variable name 'tweet'.

In [4]:
tweet = "PM @TurnbullMalcolm: Under changes agreed to today, it's 'inconceivable' Brighton terrorist would have got parole. @theheraldsun #auspol" 

'Call' (programmer speak) your tweet to check it's saved

In [5]:
tweet

"PM @TurnbullMalcolm: Under changes agreed to today, it's 'inconceivable' Brighton terrorist would have got parole. @theheraldsun #auspol"

Now we use our special tweet tokenizer to tell the computer to recognise my tweet as a list of words, not a long string of characters. To do this a create and save another variable. This is common practice.

In [6]:
tweet_tokens = casual_tokenize(tweet)

Now call your new variable and observe how it differs from when you called the first variable 'tweet'

In [7]:
tweet_tokens

['PM',
 '@TurnbullMalcolm',
 ':',
 'Under',
 'changes',
 'agreed',
 'to',
 'today',
 ',',
 "it's",
 "'",
 'inconceivable',
 "'",
 'Brighton',
 'terrorist',
 'would',
 'have',
 'got',
 'parole',
 '.',
 '@theheraldsun',
 '#auspol']

_
Compare the output of the variable sentence and the variable words. Notice that in the latter, the words are represented as a list._

### Challenge!
Try running `tweet[1]` and the `tweet_tokens[1]` in separate cells. Observe what happens. What unit is each variable count? Try changing the numbers in the square brackets. Have you noticed that Python starts counting at 0?
Using the `casual_tokenize()` function of nltk has changed our sentence into a list of words that can be searched, rather than characters. We saved our initial sentence as ‘tweet’ and the list of tokenised words as ‘tweet_tokens’, using numbers within the square brackets allows us to ask the computer what value (character or word) is at a particular position in the list. This is called indexing. A list in computer programming is an abstract data type that represents a countable number of ordered values. You can learn more about list function and data structures in Python [here](https://docs.python.org/3/tutorial/datastructures.html).

In [8]:
tweet[1]

'M'

In [9]:
tweet_tokens[1]

'@TurnbullMalcolm'

In [11]:
tweet_tokens[0] #noticed that Python starts counting at 0

'PM'

## Lists
Python treats a text as a long list of words. First, we'll make some lists of our own, to give you an idea of how a list behaves.

In [14]:
sent1
len(sent1)

4

The opening sentences of each of our texts have been pre-defined for you. You can inspect them by typing in `sent2` etc.
You can add lists together, creating a new list containing all the items from both lists. You can do this by typing out the two lists or you can add two or more pre-defined lists. This is called concatenation.

In [15]:
sent4 + sent1

['Fellow',
 '-',
 'Citizens',
 'of',
 'the',
 'Senate',
 'and',
 'of',
 'the',
 'House',
 'of',
 'Representatives',
 ':',
 'Call',
 'me',
 'Ishmael',
 '.']

We can also add an item to the end of a list by appending. When we ``append()``, the list itself is updated.

In [16]:
sent1.append('Please')
sent1

['Call', 'me', 'Ishmael', '.', 'Please']

##### Indexing Lists
We can navigate this list with the help of indexes. Just as we can find out the number of times a word occurs in a text, we can also find where a word first occurs. We can navigate to different points in a text without restriction, so long as we can describe where we want to be.

In [17]:
print(text4.index('awaken'))

173


This works in reverse as well. We can ask Python to locate the 158th item in our list (note that we use square brackets here, not parentheses)

In [18]:
print(text4[173])

awaken


As well as pulling out individual items from a list, indexes can be used to pull out selections of text from a large corpus to inspect. We call this slicing.

In [19]:
print(text5[16715:16735])

['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it']


If we're asking for the beginning or end of a text, we can leave out the first or second number. For instance, [:5] will give us the first five items in a list while [8:] will give us all the elements from the eighth to the end.

In [21]:
print(text2[:10])
print(text4[145700:])

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER']
['upon', 'us', ',', 'we', 'carried', 'forth', 'that', 'great', 'gift', 'of', 'freedom', 'and', 'delivered', 'it', 'safely', 'to', 'future', 'generations', '.', 'Thank', 'you', '.', 'God', 'bless', 'you', '.', 'And', 'God', 'bless', 'the', 'United', 'States', 'of', 'America', '.']


To help you understand how indexes work, let's create one.
We start by defining the name of our index and then add the items. You probably won't do this in your own work, but you may want to manipulate an index in other ways. Pay attention to the quote marks and commas when you create your test sentence.

In [22]:
sent = ['The', 'quick', 'brown', 'fox']
print(sent[0])
print(sent[2])

The
brown


Note that the first element in the list is zero. This is because we are telling Python to go zero steps forward in the list. If we use an index that is too large (that is, we ask for something that doesn't exist), we'll get an error.
We can modify elements in a list by assigning new data to one of its index values. We can also replace a slice with new material.

In [23]:
sent[2] = 'furry'
sent[3] = 'child'
print(sent)

['The', 'quick', 'furry', 'child']
