# Part-of-Speech Tagging

Part-of-speech tagging is the task of identifying the grammatical classes of words in a sentence. In this class, we will use the concept of n-grams to help us automatically identify grammatical classes.

## Exercise 1

In the sentences below, identify the *nouns*, *verbs*, *adjectives*, and *adverbs*:

1. Today I woke up serenely and saw that it was a beautiful and calm day.
1. The mutation of fungi is capable of controlling people's minds!
1. Every day, the morning Sun comes and challenges us!
1. It is no use trying to make an automatic system that does something we do not understand the result of!

## Exercise 2

There are many words that always have the same PoS definition - maybe the word Sun, for example, is always a noun. However, there are others that can change their meaning according to the context, such as "house": "I live in a house" (noun), versus "I like house music" (adjective), versus "The museums house a collection of ancient artifacts" (verb).

Because of that, it is important to use context to determine the

Recall that our language model context model was:

$$
𝑃(𝑤_𝑛∣𝑤_{𝑛−1}, w_{n-2}, \cdots, w_{n-L})
$$

Now, we can make a small change and use:

$$
𝑃(\text{tag}∣w_n, 𝑤_{𝑛−1}, w_{n-2}, \cdots, w_{n-L})
$$

Similarly to the language models, we can use a fallback n-gram strategy to make a reasonable model. But, first, we will need to download a corpus:

In [3]:
import nltk
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to /home/tiago/nltk_data...
[nltk_data]   Package brown is already up-to-date!


A corpus is a collection of texts. The [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) has many phrases, with categories and word-level tags for part-of-speech. This was done manually by a team of brave taggers. Here are some highlights on how to use the [Brown corpus in NLTK](https://www.nltk.org/book/ch02.html):

In [4]:
# This is a list of categories in the Brown corpus
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [5]:
# This is a list of sentences in a category:
brown.sents(categories='hobbies')
# Each sentence is a list of words

[['Too', 'often', 'a', 'beginning', 'bodybuilder', 'has', 'to', 'do', 'his', 'training', 'secretly', 'either', 'because', 'his', 'parents', "don't", 'want', 'sonny-boy', 'to', '``', 'lift', 'all', 'those', 'old', 'barbell', 'things', "''", 'because', '``', "you'll", 'stunt', 'your', 'growth', "''", 'or', 'because', 'childish', 'taunts', 'from', 'his', 'schoolmates', ',', 'like', '``', 'Hey', 'lookit', 'Mr.', 'America', ';', ';'], ['whaddya', 'gonna', 'do', 'with', 'all', 'those', 'muscles', '(', 'of', 'which', 'he', 'has', 'none', 'at', 'the', 'time', ')', "''", '?', '?'], ...]

In [6]:
# This is a list of *tagged* sentences in a category:
brown.tagged_sents(categories='hobbies')
# You can find the meaning of the tags by looking at the Wikipedia article: https://en.wikipedia.org/wiki/Brown_Corpus

[[('Too', 'QL'), ('often', 'RB'), ('a', 'AT'), ('beginning', 'VBG'), ('bodybuilder', 'NN'), ('has', 'HVZ'), ('to', 'TO'), ('do', 'DO'), ('his', 'PP$'), ('training', 'NN'), ('secretly', 'RB'), ('either', 'CC'), ('because', 'CS'), ('his', 'PP$'), ('parents', 'NNS'), ("don't", 'DO*'), ('want', 'VB'), ('sonny-boy', 'NN'), ('to', 'TO'), ('``', '``'), ('lift', 'VB'), ('all', 'ABN'), ('those', 'DTS'), ('old', 'JJ'), ('barbell', 'NN'), ('things', 'NNS'), ("''", "''"), ('because', 'CS'), ('``', '``'), ("you'll", 'PPSS+MD'), ('stunt', 'VB'), ('your', 'PP$'), ('growth', 'NN'), ("''", "''"), ('or', 'CC'), ('because', 'CS'), ('childish', 'JJ'), ('taunts', 'NNS'), ('from', 'IN'), ('his', 'PP$'), ('schoolmates', 'NNS'), (',', ','), ('like', 'CS'), ('``', '``'), ('Hey', 'UH'), ('lookit', 'VB+IN'), ('Mr.', 'NP'), ('America', 'NP'), (';', '.'), (';', '.')], [('whaddya', 'WDT+BER+PP'), ('gonna', 'VBG+TO'), ('do', 'DO'), ('with', 'IN'), ('all', 'ABN'), ('those', 'DTS'), ('muscles', 'NNS'), ('(', '('), ('o

**TASK**

Make code to count the proportion of each tag throughout the corpus. If you finish this too quickly, subdivide your count by category.

In [14]:
# Solve the exercise here
from collections import defaultdict

def count_tags_in_sentence(dataset : list) -> dict:
    count = defaultdict(int)
    for sentence in dataset:
        for element in sentence:
            count[element[1]] += 1
        
    return count

count = count_tags_in_sentence(brown.tagged_sents(categories='lore'))

    

In [12]:
import pandas as pd 
df = pd.Series(count).sort_values(ascending=False)
df / df.sum()

NN             0.151375
IN             0.104329
AT             0.084352
NNS            0.062214
JJ             0.059299
                 ...   
PP$-TL         0.000012
BE-HL          0.000012
DTS-HL         0.000012
DO-HL          0.000012
FW-IN+NN-TL    0.000012
Length: 209, dtype: float64

## Exercise 3

Suppose we have no idea how to choose a PoS tag for any word. Our best guess is to pick one. For example, in the code below, we tag all words as qualifiers (QL). As we can see from the evaluation process, this is not a very accurate method:

In [15]:
from nltk.tag import DefaultTagger
default_tagger = DefaultTagger('NN')
sentence = brown.sents(categories='hobbies')[0]
sentence_tagged = default_tagger.tag(sentence)
sentence_ground_truth = brown.tagged_sents(categories='hobbies')[0]
print(sentence_tagged)
print(sentence_ground_truth)
accuracy = default_tagger.accuracy([sentence_ground_truth])
print(f'Accuracy: {accuracy}')

[('Too', 'NN'), ('often', 'NN'), ('a', 'NN'), ('beginning', 'NN'), ('bodybuilder', 'NN'), ('has', 'NN'), ('to', 'NN'), ('do', 'NN'), ('his', 'NN'), ('training', 'NN'), ('secretly', 'NN'), ('either', 'NN'), ('because', 'NN'), ('his', 'NN'), ('parents', 'NN'), ("don't", 'NN'), ('want', 'NN'), ('sonny-boy', 'NN'), ('to', 'NN'), ('``', 'NN'), ('lift', 'NN'), ('all', 'NN'), ('those', 'NN'), ('old', 'NN'), ('barbell', 'NN'), ('things', 'NN'), ("''", 'NN'), ('because', 'NN'), ('``', 'NN'), ("you'll", 'NN'), ('stunt', 'NN'), ('your', 'NN'), ('growth', 'NN'), ("''", 'NN'), ('or', 'NN'), ('because', 'NN'), ('childish', 'NN'), ('taunts', 'NN'), ('from', 'NN'), ('his', 'NN'), ('schoolmates', 'NN'), (',', 'NN'), ('like', 'NN'), ('``', 'NN'), ('Hey', 'NN'), ('lookit', 'NN'), ('Mr.', 'NN'), ('America', 'NN'), (';', 'NN'), (';', 'NN')]
[('Too', 'QL'), ('often', 'RB'), ('a', 'AT'), ('beginning', 'VBG'), ('bodybuilder', 'NN'), ('has', 'HVZ'), ('to', 'TO'), ('do', 'DO'), ('his', 'PP$'), ('training', 'NN'

1. Change the evaluation process above to calculate accuracy over all the 'editorial' category of the Brown corpus
1. Change the default tag to the most common tag you have found in Exercise 2. What is your change in accuracy?

## Exercise 4

NLTK also comes with Unigram taggers. These taggers use $L=0$ in the context, that is, they always tag a word in the same way.

The Unigram taggers requires training data. In the example below, we use the tagged sentences from the "hobbies" category for that.

Of course, there is a change that the word we want does not appear in the training data. In this case, what is the best alternative? Well, we can choose the most common tag in the dataset - which is what we had been doing with the default tagger. This strategy is called "backoff".

In [21]:
from nltk.tag import UnigramTagger
import numpy as np
unigram_tagger = UnigramTagger(brown.tagged_sents(categories='fiction'), backoff=default_tagger)
accuracies = []
for sentence_ground_truth in brown.tagged_sents(categories='adventure'):
    accuracy = unigram_tagger.accuracy([sentence_ground_truth])
    accuracies.append(accuracy)
print(f'Accuracy: {np.mean(accuracies)} +- {np.std(accuracies)}')

Accuracy: 0.8454339465388897 +- 0.11427520060748338


Train the Unigram tagger in the 'hobbies' category. Then, test it in sentences of each category.

1. What do you observe in the results?
1. Does this phenomenon happen for any category chosen for training?
1. Make an experiment to find out if this phenomenon is due to train and test being from the same category, of if it is due to they containing strictly the same sentences.

## Exercise 5

We can also use n-gram taggers:

In [25]:
from nltk import NgramTagger

train_category = 'fiction'
taggers = [default_tagger]

for n in range(1, 6):
    new_tagger = NgramTagger(
        n=n,
        train=brown.tagged_sents(categories=train_category),
        backoff=taggers[-1],
    )
    taggers.append(new_tagger)

accuracies = []
for sentence_ground_truth in brown.tagged_sents(categories='adventure'):
    accuracy = taggers[-1].accuracy([sentence_ground_truth])
    accuracies.append(accuracy)
print(f'Accuracy: {np.mean(accuracies)} +- {np.std(accuracies)}')

Accuracy: 0.85729407467091 +- 0.11353227826905676


1. Evaluate the bigram tagger
1. Make a function that receives a value of $n$ and a training set as parameters, and returns a PoS tagger with n-gram taggers for that $n$ with a successive backoff option that
1. Make a figure showing how the accuracy increases in the Brown corpus when $n$ is increased.

## Exercise 6

One measure of wordiness in a text is the lexical density. Lexical Density is a concept that comes from the idea that nouns and verbs convey meaning, and other words are only auxiliary. The Lexical Density is calculated for a sentence as the number of nouns and verbs, divided by the total number of words in the sentence.

Make a function that receives a sentence (and possibly a PoS tagger) as inputs and returns the sentence's lexical density.



In [32]:
def noun_or_verb(tag : str) -> bool:
    return tag.startswith('N') or tag.startswith('V')

def lexical_density(tagged_sent : list) -> float:
    n_tags = len(tagged_sent)
    n_verbs_nouns = 0
    for tagged_word in tagged_sent:
        if noun_or_verb(tagged_word[1]):
            n_verbs_nouns += 1
    return n_verbs_nouns / n_tags

tagged_sent = taggers[-1].tag("You shall not pass".split())
print(tagged_sent)
print(lexical_density(tagged_sent=tagged_sent))

[('You', 'PPSS'), ('shall', 'MD'), ('not', '*'), ('pass', 'NN')]
0.25
