# LELA32052 Computational Linguistics Week 9

This week we are going to take a look at part of speech tagging.

## Tagged corpora
In looking to understand part of speech tagging, it is useful to start by looking at some human (rather than machine) tagged data. NLTK contains a number of corpora. We can import a few of these as follows:

In [None]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
nltk.download('sinica_treebank')
nltk.download('indian')
nltk.download('conll2002')
nltk.download('cess_cat')

In [None]:
brown.tagged_words()[1:25]

In [None]:
nltk.download('universal_tagset')

In [None]:
brown.tagged_words(tagset="universal")[1:25]

In [None]:
nltk.corpus.sinica_treebank.tagged_words() # Chinese

In [None]:
nltk.corpus.indian.tagged_words() # Bangla, Hindi, Marathi, and Telugu language data

In [None]:
nltk.corpus.conll2002.tagged_words() # Spanish

In [None]:
nltk.corpus.cess_cat.tagged_words() # Catalan

## Inspecting tagged corpora

Inspecting human tagged corpora can be useful for both linguistic research and for building taggers. We can use the NLTK toolkit to do this.

Most straightforwardly we can look at the frequency with which particular words are given a tag (we will return to this later when we come to build a tagger).

In [None]:
sent = [("the","DET"),("man","NOUN"),("walked","VERB"),("the","DET"),("dog","NOUN")]

In [None]:
cfd1 = nltk.ConditionalFreqDist(sent)
cfd1['the']

When we apply this to whole corpora, it becomes useful.

In [None]:
brown_tagged = brown.tagged_words(tagset='universal')
cfd1 = nltk.ConditionalFreqDist(brown_tagged)
cfd1['the']

In [None]:
cfd1['run']

And if we additionally use a couple of other NLTK tools (which we don't have time to cover in detail - I just want to give you a sense of what is possible), we can look at the frequency with which particular word classes precede particular words

In [None]:
brown_tagged = brown.tagged_words(tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_tagged) if a[0] == 'car']
fd = nltk.FreqDist(tags)
fd.tabulate()

Or the frequency with which particular word classes precede other word classes:

In [None]:
brown_tagged = brown.tagged_words(tagset='universal')
word_tag_pairs = nltk.bigrams(brown_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
noun_preceders_fd = nltk.FreqDist(noun_preceders)
[(wt,_) for (wt, _) in noun_preceders_fd.most_common()]

And you can even search for particular constructional patterns

In [None]:
for tagged_sent in brown.tagged_sents(categories="news")[1:75]:
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(tagged_sent):
        if (t1.startswith('V') and w2 == 'and' and t3.startswith('V')):
            print(w1, w2, w3)

## Building an automatic tagger

A very simple approach to automated tagging that actually works quite well is to find the most common tag for each word in a training corpus (as we did above) and just tag all occurences of each word with its most common tag:

In [None]:
brown_tagged_sents = brown.tagged_sents(tagset='universal')

In [None]:
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

In [None]:
unigram_tagger.tag(["the","cat","sat","on","the","mat"])

We can formally evaluate this by splitting our data into a training set and a testing set. We obtain the by-word tag frequencies from the training set and evaluate by tagging the test set and comparing our predicted tags to the human tags.

In [None]:
training_set_size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:training_set_size]
test_sents = brown_tagged_sents[training_set_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.accuracy(test_sents)

### Regular expression based tagging

As a next step we want to use a more intelligent way to deal with words we haven't seen before, but making use of their orthography and/or morphology. Write regular expressions to classify words in this way and see if you can improve performance. I've added one example rule to get you started.

In [None]:
patterns = [
    (r'.*ing$', 'VERB'),
      ]

In [None]:
t0 = nltk.DefaultTagger('NOUN')
t1 = nltk.RegexpTagger(patterns, backoff=t0)
t2 = nltk.UnigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

As with other classification tasks we can generate a confusion matrix to see where things are going right or wrong.

In [None]:
from sklearn.metrics import confusion_matrix
import pandas as pd
predicted = [tag for sent in brown.sents(categories='editorial') for (word, tag) in t2.tag(sent)]
true = [tag for (word, tag) in brown.tagged_words(categories='editorial',tagset="universal")]
cm=pd.DataFrame(confusion_matrix(predicted, true),index=list(set(predicted)),columns=list(set(predicted)))
cm

### Looking at the context

We want to improve this, and an obvious next step is to give the tag that is most frequent for this word when it follows the previous word. The problem is this doesn't do very well. Any idea why?

In [None]:
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.evaluate(test_sents)

We can still make use of the bigram information by combining it with the unigram tagger via a process known as backing off - for each word we check whether we have seen that word and preceding word in our training data. If we have then we tag it with the most frequent tag for that word in that context. If we haven't seen it then we tag the word with its most frequent tag regardless of context. And if we haven't seen the word before we tag it as a noun.

In [None]:
t0 = nltk.DefaultTagger('NOUN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

### NLTK's Averaged Perceptron tagger

NLTKs default prebuilt tagger uses a Perceptron just like that we have been using for other tasks on the module. For more information on this approach see here: https://explosion.ai/blog/part-of-speech-pos-tagger-in-python


In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')

nltk.download('averaged_perceptron_tagger_eng')

It can be run straightforwardly like this:

In [None]:
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text, tagset="universal")

### POS tagging in other languages

POS taggers are available for a great many languages. A popular package called Spacy contains a number. Here, as an example, is a German tagger.

In [None]:
!pip install -U spacy

In [None]:
import spacy

In [None]:
!python -m spacy download de_core_news_sm

In [None]:
nlp = spacy.load('de_core_news_sm')

In [None]:
text = "Das ist nicht gut."

In [None]:
s1_t = nlp(text)

In [None]:
for tk in s1_t:
    print(tk.text, tk.tag_, tk.pos_)

### Chunking / Shallow Parsing

Chunking involves grouping together words into elementary phrases. In its most common form it doesn't involve any hierachical structure.


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

In [None]:
text = nltk.word_tokenize("I study Linguistics and Social Anthropology at the University of Manchester")

In [None]:
grammar = r"""
  NP: {<DET|ADP>?<ADJ>*<NOUN>}
      {<NOUN>+}
"""
sent=nltk.pos_tag(text,tagset="universal")
cp = nltk.RegexpParser(grammar)
cs = cp.parse(sent)
print(cs)

Update the grammar so that it produces the following shallow parse: <br> <br>
(S <br>
  (NP I/PRON) <br>
  study/VERB <br>
  (NP Linguistics/NOUN and/CONJ Social/NOUN Anthropology/NOUN) <br>
  at/ADP <br>
  (NP the/DET University/NOUN of/ADP Manchester/NOUN)) <br>