## Sequence Labeling
<ul><li>Tokenize</li>
<ul><li>Functions</li>
<li>For-Loops</li></ul>
<li>Part of Speech Tags</li>
<ul><li>Conditional Statements</li></ul>
<li>Named Entity Recognition</li>
<li>Geographic Imagination</li></ul>

## 0. Preparation

Python has many basic, out-of-the-box functions that we use all the time for programming. When we want to extend our reach beyond the basics, new functions are made available through <i>packages</i> like NLTK. 

Packages typically need to be downloaded individually. However if you are using a platform like Anaconda (https://www.continuum.io/downloads), then many common packages are already on your computer.

In order to access the new functions contained within a package, we have to <i>import</i> it into our programming environment.

In [None]:
# Install modules (using pip)
!pip install nltk
!pip install numpy

# Import modules
import nltk

In [None]:
# Check that NLTK has access to appropriate models for our project

modules = ["averaged_perceptron_tagger", "maxent_ne_chunker", "punkt", "words"]

for module in modules:
    nltk.download(module)

In [None]:
# We'll use the opening paragraph of Renfrew & Bahn's "Archaeology: Theories, Methods and Practice" 
# throughout this notebook for our exercises

paragraph = 'Archaeology is partly the discovery of the treasures of the \
past, partly the work of the scientific analyst, partly the \
exercise of the creative imagination. It is toiling in the sun \
on an excavation in the deserts of Central Asia, it is working \
with living Inuit in the snows of Alaska. It is diving down to \
Spanish wrecks off the coast of Florida, and it is investigating  \
the sewers of Roman York. But it is also the analysis \
of materials in the laboratory, and the interpretation of \
what these things mean for the human story. Finally, \
archaeology is also the conservation of the world’s cultural \
heritage, the understanding of stakeholders of the past, \
and the protection of the past from looting and careless destruction' 

# 1. Tokenise

Natural Language Processing is the field and set of methods dedicated to converting human language into something that the computer can read. It's important to keep in mind that a computer does not even know what a <i>word</i> is without receiving direct instructions from a human.

Fortunately NLTK has an easy-to-implement set of instructions encoded in its function <i>word_tokenize()</i>. The idea with this function is that we can put a string of human-language text in between its parentheses and it will return a list of the individual words from that text. NLTK has a similar function, as well, called <i>sent_tokenize()</i> that does the same thing, but returns a list of individual sentences.

Very often we want to tokenize our texts by word, while retaining infomation about the boundaries between sentences. In order to do this, we will first use <i>sent_tokenize()</i> and then iterate through our list of sentences with <i>word_tokenize()</i>

### Functions

In [None]:
# Import the functions we will use directly

from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
word_tokenize("What is the purpose of Stonehenge?")

In [None]:
# Assign our sentence to a variable

ylvis = "What is the purpose of Stonehenge?"

In [None]:
# Inspect our new variable

ylvis

In [None]:
# Feed our new variable into the function

word_tokenize(ylvis)

In [None]:
# We can also assign the output of a function to a variable

ylvis_list = word_tokenize(ylvis)

In [None]:
# Inspect the output variable

ylvis_list

In [None]:
# And we can input that new variable into other functions and so on

len(ylvis_list)

In [None]:
# Assign three sentences of dialogue to a new variable

three_sentences = "What is the purpose of Stonehenge? A giant granite birthday cake? Or a prison far too easy to escape?"

In [None]:
# Inspect the newest variable

three_sentences

In [None]:
# Tokenize text by word

word_tokenize(three_sentences)

In [None]:
# Tokenize text by sentence

sent_tokenize(three_sentences)

In [None]:
## EXERCISE. Use the function word_tokenize() in order to get a list of words
##           from the paragraph from Renfrew & Bahn at the top of this notebook. 
##           How many tokens does the paragraph contain?

## EXERCISE. Use the function sent_tokenize() in order to get a list of sentences
##           from the Renfrew & Bahn paragraph.

## Bonus: What is the average number of words per sentence in this paragraph?

### For-Loops

We iterate through the elements in a list using the "for" and "in" syntax. You can tell those words do something special because they appear in green!

In [None]:
# Combine sentence- and word-level tokenization

# The line below gets indented, so that our script knows what to do
# to each element in the list when it comes up

sentence_list = sent_tokenize(three_sentences) # save sentences in variable

for sentence in sentence_list: # tell Python to go through all sentences in the sentence_list and..
    print(word_tokenize(sentence)) # ... print the list of tokens in each sentence

In [None]:
## EXERCISE. For the paragraph from Renfrew & Bahn from earlier, use a for-loop to get
##           a list of words from each sentence individually.

# Detour: Word Frequency

For those not-yet-familiar with Natural Language Processing, it often comes as a surprise how powerful word frequencies are. Simply creating a list of the unique words in a text and tallying the number of times it appears encodes information about authorship, genre, time period and author nationality among other features. Frankly, this is mind boggling!

It is exceptionally easy to create this kind of tally in Python. There is a simple out-of-the-box function that we can use to count the number of times a token appears in a list. Yesterday, we looked at a function from NLTK called <i>FreqDist</i> that is a special version of the one we will look at today, <i>Counter</i>.

In [None]:
# Import a handy counting function
# Reports number of time each unique element appears in a list

from collections import Counter

In [None]:
# Create a list of tokens
indiana_jones_quote = "If you want to be a good archeologist, you have got to get out of the library!"
jones_tokens = word_tokenize(indiana_jones_quote)

In [None]:
# Inspect token list
jones_tokens

In [None]:
# Tally the appearances of each unique token
Counter(jones_tokens)

In [None]:
# Assign the tally to a new variable
tokens_counted = Counter(jones_tokens)

In [None]:
# Return unique tokens, sorted by number of appearances in list
tokens_counted.most_common()

In [None]:
## EXERCISE. What is the most common word in the Renfrew & Bahn paragraph from earlier?
##           How often does 'past' appear?

# 2. Part of Speech

As trained readers, we know that language partly operates according to (or sometimes against!) abstract, underlying structures, such as grammar. Identifying a word's part of speech, or tagging it, is an extremely sophisticated task that remains an open problem in the Natural Language Processing world. At this point, state-of-the-art taggers have somewhere in the neighborhood of 98% accuracy.

NLTK's default tagger, <i>pos_tag()</i>, has an accuracy just shy of that with the trade-off that it is comparatively fast. Simply place a list of tokens between its parentheses and it returns a new list where each item is the original word alongside its predicted part of speech.

The tags themselves come from the Penn Treebank and a full list of them can be found here: <a href="http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html</a>

### Common POS taggers
<table align='left'>
    <tr>
        <td>from nltk.tag.perceptron</td>
        <td>import PerceptronTagger</td>
    </tr>
    <tr>
        <td>from nltk.tag.brill</td>
        <td>import BrillTagger</td>
    </tr>
    <tr>
        <td>from nltk.tag.stanford</td>
        <td>import StanfordTagger, StanfordPOSTagger, StanfordNERTagger</td>
    </tr>
</table>

Note: NLTK simply offers a wrapper for the Stanford taggers, which allows you to use them in Python, rather than their native Java. Stanford models can be downloaded from here: http://nlp.stanford.edu/software/

In [None]:
# NLTK's current default POS tagger is the 'averaged perceptron' as described here:
# https://spacy.io/blog/part-of-speech-POS-tagger-in-python

from nltk import pos_tag

# Create variable for new sentence
new_sentence = "Broadly, archaeology is the study of the human past through its material remains."

# Create list of word tokens
new_tokens = word_tokenize(new_sentence)

# Assign a POS tag to each token
pos_tag(new_tokens)

In [None]:
# Let's refresh ourselves on the functions and for-loops from earlier

# An old variable revisited!
three_sentences = "What is the purpose of Stonehenge? A giant granite birthday cake? Or a prison far too easy to escape?"

# Re-make the list of sentences from the text
sentence_list = sent_tokenize(three_sentences)

# Re-tokenize each sentence by word
tokenized_sentences = [word_tokenize(sent) for sent in sentence_list]

In [None]:
# Inspect the list of lists of tokens
tokenized_sentences

In [None]:
# Now iterate through the tokenized sentences and POS tag them
for sentence in tokenized_sentences:
    print(pos_tag(sentence))

In [None]:
# Collect the tagged sentences in a list
tagged_sentences = [pos_tag(sentence) for sentence in tokenized_sentences]

In [None]:
## EXERCISE. Get POS tags for the very new sentence below.

very_new_sentence = "We do not follow maps to buried treasure, and X never, ever marks the spot."

## EXERCISE. Get POS tags for the paragraph from Renfrew and Bahn.


### Conditional Statements

In [None]:
# The entries in each tagged sentence consist of a token-tag pair.
# Sometimes we just want one of those values.

# When the entries in a list are paired like the (token,tag) format above,
# we can label the elements seperately while we iterate through

for sentence in tagged_sentences:
    for token, tag in sentence:
        print(token)

In [None]:
# Of course, we can access either value in the pair

for sentence in tagged_sentences:
    for token, tag in sentence:
        print(tag)

In [None]:
# We can also add a condition: IF the condition is TRUE,
# then the script continues with the next indented line.
# Otherwise, it gets skipped!

# Calling the noun tag for our IF statement

for sentence in tagged_sentences:
    for token, tag in sentence:
        if tag=='NN':
            print(token)

In [None]:
# Calling the adjective tag for our IF statement

for sentence in tagged_sentences:
    for token, tag in sentence:
        if tag=='JJ':
            print(token)

In [None]:
# The double equals sign is a test of equality NOT a variable assignment

5 == 3

In [None]:
## EXERCISE. Return the nouns from the opening paragraph of Renfrew & Bahn.

# 3. Named Entity Recognition

Among parts of speech, names and proper nouns are of particular significance, since they are the more-or-less unique keywords that identify phenomena of social relevance (including people, places, time periods, artefacts, etc). After all, there is just one <i>Germany</i>, and in an excavation report, a word like <i>Neolithic</i> typically acts as a more-or-less stable referent over the course of the text. (Or perhaps we are interested in thinking about the degree of instability with which it is used!)

The identification of these kinds of names is referred to as Named Entity Recognition, or NER. The challenge is twofold. First, it has to be determined whether a name spans multiple tokens. (These multi-token grammatical units are referred to as <i>chunks</i>; the process, <i>chunking</i>.) Second, we would ideally distinguish among categories of entity. Is <i>Neolithic</i> a geographic location? Just who is this <i>Germany</i> I hear so much about?

To this end, the function ne_chunk() receives a list of tokens including their parts of speech and returns a nested list where named entities' tokens are chunked together, along with their category as predicted by the computer.

Unfortunately, standard NER methods (such as this one), only find standard named entities, such as persons, locations and organisations, and not the entities we are insterested in as archaeologists (artefacts, species, etc). Standard named entities are used here as an example.

In [None]:
# Let's start with a fresh sentence containing several proper names

ner_sentence = 'King Arthur is the sovereign over Britain and lord of the Round Table.'
ner_tokens = word_tokenize(ner_sentence)
pos_tags = pos_tag(ner_tokens)

In [None]:
# Inspect the POS tags
pos_tags

In [None]:
# Import the NER funtion

from nltk import ne_chunk

chunks = ne_chunk(pos_tags)

In [None]:
# NLTK is finicky here, so we need to use 'print' to inspect
print(chunks)

In [None]:
# We'll iterate through our list of chunks. Name Entities are grouped
# together into 'nltk.tree.Tree'. (This is an under-the-hood data type.)

for chunk in chunks:
    if type(chunk)==nltk.tree.Tree:
            print(chunk)

In [None]:
# Let's select just ones with the 'GPE' (Geo-Political Entity) designation, and only print the entity

for chunk in chunks:
    if type(chunk)==nltk.tree.Tree:
        if chunk.label()=='GPE':
            print(" ".join([token[0] for token in chunk.leaves()])) # this just loops over the tokens in the entity, and joins them together

In [None]:
# When we have multiple conditions -- i.e. multiple 'if' statements --
# we can put them together on a line using 'and'.

for chunk in chunks:
    if type(chunk)==nltk.tree.Tree and chunk.label()=='GPE':
            print(" ".join([token[0] for token in chunk.leaves()]))

In [None]:
## EXERCISE. Retrieve the place names (excluding POS tags) from the sentence below.

swallow_skeptic = "Oh yeah, an African swallow, maybe, but not a European swallow."

# 4. Optional Assignment: Geography in Renfrew & Bahn

An excerpt of a chapter of Renfrew & Bahn has been added in the folder. This bit of text describes how to survey and excavate, using examples from around the world.

Using the techniques from this lesson, count the number of times each place name appears in this text. Return a list of the most common place names.


Questions:

- Does the list of the most common names make sense? 
- Are there things that don't? 
- Do you see any errors?
- Do you think there is any geographical bias in the examples used by Renfrew & Bahn?


In [None]:
# Read sample text of Renfrew & Bahn chapter from file
# Creates variable 'rb_text' with whole text in single string

rb_text = open('r_and_b_sample.txt',encoding="utf-8").read()


# Hint: you will need to create a new, empty list with "list = []", and then use "list.append(item)" to add all place names to this list

The solution can be found in the file "r_and_b_ner_solution.py"