<a href="https://colab.research.google.com/github/benjamininden/AI-teaching-python/blob/main/BrillTaggerNLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Implementing a Brill tagger with NLTK

The [Brill tagger](https://en.wikipedia.org/wiki/Brill_tagger) was invented by Eric Brill in 1993. It is a rule-based tagger, the [NLTK library](http://www.nltk.org/) provides some code that helps to implement it. While the Brill tagger is no longer state of the art, it can still provide a baseline, help to understand the problem of tagging, or be an incentive to explore what NLTK has to offer. Here you can find an implementation of the Brill tagger (with minor variations).

The first step is to import necessary Python packages, and download some NLTK data. The universal tag set contains a coarse-grained set of grammatical categories that will be used to tag the words. The [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) is a collection of annotated texts in American English that has been used in linguistic research for decades. It will be used to train and test our tagger.

In [4]:
import nltk
import nltk.tag.brill
nltk.download('universal_tagset')
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Next, we extract a training set of sentences from the Brown corpus (the 10% not used will later be used as a test set).

In [5]:
brown_tagged_sents = brown.tagged_sents(categories='news', tagset='universal')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]

Once trained, the Brill tagger proceeds in two stages. In the first stage, each word that occured in the training corpus is tagged with the tag that it received most frequently there. Each capitalized word that did not occur in the training corpus is tagged as proper noun, while each other word is tagged with the tag that is most common for words ending with the same three letters. Below, we specify these rules but with a few differences, we do not always look for the last three letters but for other patterns that commonly occur in English words. [Regular expressions](https://docs.python.org/3/library/re.html) are used to specify those patterns.

In [6]:
backoff = nltk.RegexpTagger([
(r'^[A-Z]+.*$', 'NOUN'),
# proper nouns
(r'^-?[0-9]+(.[0-9]+)?$', 'NUM'), # cardinal numbers
(r'(The|the|A|a|An|an)$', 'DET'), # articles
(r'.*able$', 'ADJ'),
# adjectives
(r'.*ness$', 'NOUN'),
# nouns formed from adjectives
(r'.*ly$', 'ADV'),
# adverbs
(r'.*s$', 'NOUN'),
# plural nouns
(r'.*ing$', 'VERB'),
# gerunds
(r'.*ed$', 'VERB'),
# past tense verbs
(r'.*', 'NOUN')
# nouns (default)
])
baseline_tagger = nltk.UnigramTagger(train_sents, backoff=backoff)

In the second stage of Brill tagging, rules from the rule set are applied repeatedly until a threshold is reached or no more rules apply. The rules have forms such as, for example, "change tag-a to tag-b if the preceding word is tagged tag-z and the following word is tagged tag-w".

How are these rules learned? The training corpus is first tagged using the first stage of the trained Brill tagger. By comparing the results against the true tags, a list of tag error triplets of the form <tag-a, tag-b, number> is generated, in which the elements indicate the number of times the first stage mistagged with tag-a when the correct tag was tag-b.

The tagger has a list of rule templates. Each of these templates is now
instantiated with particular tags such that a given error
would be corrected by the rule. The rule that leads to the strongest decrease in error rate if applied on the whole corpus is then added to the rule set. The code below sets up a BrillTaggerTrainer to train the rules of our Brill tagger.

In [7]:
tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger,
nltk.tag.brill.brill24())
brill_tagger = tt.train(train_sents, max_rules=15)

The learned rules are actually human-readable, so we take a look at them. We also evaluate the Brill tagger by having it tag one sentence for purposes of demonstration, as well as calculating its accuracy on the test set.

In [8]:
print(brill_tagger.rules())
print(brill_tagger.tag(brown_sents[2007]))
print(brill_tagger.evaluate(brown_tagged_sents[size:]))

(Rule('025', 'PRT', 'ADP', [(Pos([1]),'DET')]), Rule('034', 'PRT', 'ADP', [(Pos([1]),'NOUN'), (Pos([2]),'NOUN')]), Rule('032', 'NOUN', 'VERB', [(Pos([-1]),'PRT'), (Pos([1]),'DET')]), Rule('034', 'PRT', 'ADP', [(Pos([1]),'NOUN'), (Pos([2]),'.')]), Rule('043', 'ADP', 'PRT', [(Word([0]),'all')]), Rule('043', 'ADP', 'PRT', [(Word([0]),'up')]), Rule('025', 'PRT', 'ADP', [(Pos([1]),'NUM')]), Rule('034', 'PRT', 'ADP', [(Pos([1]),'ADJ'), (Pos([2]),'NOUN')]), Rule('032', 'VERB', 'NOUN', [(Pos([-1]),'DET'), (Pos([1]),'ADP')]), Rule('038', 'ADP', 'ADV', [(Word([2]),'as')]), Rule('044', 'NOUN', 'VERB', [(Word([-1]),'would'), (Pos([-1]),'VERB')]), Rule('043', 'ADP', 'PRT', [(Word([0]),'out')]), Rule('035', 'NOUN', 'VERB', [(Word([-1]),'will')]), Rule('034', 'ADP', 'PRON', [(Pos([1]),'VERB'), (Pos([2]),'VERB')]), Rule('032', 'VERB', 'NOUN', [(Pos([-1]),'ADJ'), (Pos([1]),'.')]))
[('Various', 'ADJ'), ('of', 'ADP'), ('the', 'DET'), ('apartments', 'NOUN'), ('are', 'VERB'), ('of', 'ADP'), ('the', 'DET'),