# **IN4080 - 2023**

# **Mandatory Assignment 2** - Alessia Sanfelici

**Part 1	– Exploring	the NLTK	tagger landscape**

**Exercise	1a:	Data	Split**

For our first experiments, we limit ourselves to the news section of the Brown corpus and split it into a 
training (90%) and a validation (10%) set. (We don’t need a test set for the moment, but will build one in 
part 3.) Moreover, we use the universal tagset instead of the default one.

You should store your data split in the variables news_train and news_val. Note that you may need 
to download the corpus first.

In [17]:
import nltk
nltk.download('universal_tagset')
from nltk.corpus import brown
from sklearn.model_selection import train_test_split
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
import pandas as pd

[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\sanfe\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


In [2]:
sents = brown.tagged_sents(categories = 'news', tagset = 'universal')

In [3]:
sents

[[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')], [('The', 'DET'), ('jury', 'NOUN'), ('further', 'ADV'), ('said', 'VERB'), ('in', 'ADP'), ('term-end', 'NOUN'), ('presentments', 'NOUN'), ('that', 'ADP'), ('the', 'DET'), ('City', 'NOUN'), ('Executive', 'ADJ'), ('Committee', 'NOUN'), (',', '.'), ('which', 'DET'), ('had', 'VERB'), ('over-all', 'ADJ'), ('charge', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('election', 'NOUN'), (',', '.'), ('``', '.'), ('deserves', 'VERB'), ('the', 'DET'), ('praise', 'NOUN'), ('and', 'CONJ'), ('thanks', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('City

In [4]:
news_train, news_val = train_test_split(sents, test_size = 0.1, random_state = 42)

**Exercise 1b:	Most	Frequent	Class	Baseline**

The distribution of part-of-speech tags is typically quite skewed, with the most frequent class in general 
being common nouns. As a simple baseline, we should thus know how a model that always predicts the 
same (most frequent) class performs. This can be done with nltk.DefaultTagger. Note that we are using the universal tagset, so the 
most frequent tag is not named NN. Evaluate it on the validation set and report the accuracy

In [5]:
tags = [tag for (word, tag) in sents[0]]
max_tag = nltk.FreqDist(tags).max()
print("The most frequent tag is called", max_tag)

The most frequent tag is called NOUN


In [6]:
default_tagger = nltk.DefaultTagger('NOUN')
default_accuracy = default_tagger.accuracy(news_val)
print("Accuracy:")
print(default_accuracy)

Accuracy:
0.2996151189183855


**Exercise	1c:	Naïve	Bayes	Unigram	Tagger**

One of the first models discussed in course is a Naïve Bayes classifier that relies only on the current word 
and does not take any context into account. This model is available as nltk.UnigramTagger. Report the accuracy on the 
validation set. How does the accuracy on the universal tagset differ from the one reported on the default 
tagset in the NLTK book?

In [7]:
unigram_tagger = nltk.UnigramTagger(news_train)
unigram_accuracy = unigram_tagger.accuracy(news_val)
print("Accuracy:")
print(unigram_accuracy)

Accuracy:
0.8853251751702359


The accuracy in this case results be be lower than the one reported on the default tagset in the NLTK book. It is lower of around 5%.

**Exercise	1d:	Bigram	HMM	Tagger**

In the lectures, we spent quite some time on the HMM tagger. Evaluate it on the validation set and report the result.

In [8]:
hmm = nltk.HiddenMarkovModelTagger.train(news_train)
hmm_accuracy = hmm.accuracy(news_val)
print("Accuracy:")
print(hmm_accuracy)

Accuracy:
0.9121681634264285


**Exercise	1e:	Perceptron	with	greedy	decoding**

In the lectures, we have shortly discussed Matthew Honnibal’s proposal of a structured perceptron
tagger with greedy decoding. He argued that an extended set of features is more helpful for tagging than 
exact (Viterbi) decoding. Evaluate it on the validation set and report the result.

In [9]:
perc = nltk.PerceptronTagger(load = False)
perc.train(news_train)
perc_accuracy = perc.accuracy(news_val)
print("Accuracy:")
print(perc_accuracy)

Accuracy:
0.9654593901115168


Summarize the results of the previous exercises and discuss them in a few sentences. Do the accuracies 
correspond to your expectations?

In [20]:
accuracies = [default_accuracy, unigram_accuracy, hmm_accuracy, perc_accuracy]
taggers = ["Default Tagger", "Unigram Tagger", "Hidden Markov Model Tagger", "Perceptron Tagger"]

df_accuracies = pd.DataFrame(accuracies, columns = ["Accuracy"], index = taggers)
df_accuracies

Unnamed: 0,Accuracy
Default Tagger,0.299615
Unigram Tagger,0.885325
Hidden Markov Model Tagger,0.912168
Perceptron Tagger,0.965459


...... discussion .....

**Part	2	– Greedy	LR	taggers	and	feature	engineering**

**Exercise	2a:	Getting	started	with	a	greedy	logistic	regression tagger**

In [14]:
class ScikitGreedyTagger(nltk.TaggerI):
    def __init__(self, features, clf=LogisticRegression(max_iter = 400)):
        self.features = features
        self.classifier = clf
        self.vectorizer = DictVectorizer()

    def train(self, train_sents):
        train_feature_sets = []
        train_labels = []
        for tagged_sent in train_sents:
            history = []
            untagged_sent = nltk.tag.untag(tagged_sent)
            for i, (word, tag) in enumerate(tagged_sent):
                feature_set = self.features(untagged_sent, i, history)
                train_feature_sets.append(feature_set)
                train_labels.append(tag)
                history.append(tag)
        x_train = self.vectorizer.fit_transform(train_feature_sets)
        y_train = np.array(train_labels)
        self.classifier.fit(x_train, y_train)

    def tag(self, sentence):
        test_features = []
        history = []
        for i, word in enumerate(sentence):
            featureset = self.features(sentence, i, history)
            test_features.append(featureset)
        X_test = self.vectorizer.transform(test_features)
        tags = self.classifier.predict(X_test)
        return zip(sentence, tags)

In [15]:
def pos_features(sentence, i, history):
    features = {"curr_word": sentence[i]}
    if i == 0:
        features["prev_word"] = "<START>"
    else:
        features["prev_word"] = sentence[i-1]
    return features

In [16]:
lr_tagger = ScikitGreedyTagger(pos_features)
lr_tagger.train(news_train)
print("Accuracy:")
print(lr_tagger.accuracy(news_val))

Accuracy:
0.924010658245337


How does the accuracy of this tagger compare to the taggers tested in part 1?

In [None]:
..........

**Exercise	2b:	Adding	word	context	features**

The basic feature function contains the previous and the current word. Also add the next word and the 
word before the previous one. Describe which combination works best and keep it for the next 
experiment.

In [51]:
# Evaluation of previous word, current word and the word before the previous one

def pos_features_1(sentence, i, history):
    features = {"curr_word": sentence[i]}
    if i == 0:
        features["prev_word"] = "<START>"
        features["2_prev_word"] = "<START>"
    elif i == 1:
        features["prev_word"] = sentence[i-1]
        features["2_prev_word"] = "<START>"
    else:
        features["prev_word"] = sentence[i-1]
        features["2_prev_word"] = sentence[i-2]
    return features

lr_tagger = ScikitGreedyTagger(pos_features_1)
lr_tagger.train(news_train)
print("Accuracy:")
print(lr_tagger.accuracy(news_val))

Accuracy:
0.9243067206158098


In [52]:
# Evaluation of previous word, current word and next word

def pos_features_2(sentence, i, history):
    features = {"curr_word": sentence[i]}
    if i == 0:
        features["prev_word"] = "<START>"
    else:
        features["prev_word"] = sentence[i-1]
    try:
        features["next_word"] = sentence[i+1]
    except:
        features["next_word"] = "<END>"
    return features

lr_tagger = ScikitGreedyTagger(pos_features_2)
lr_tagger.train(news_train)
print("Accuracy:")
print(lr_tagger.accuracy(news_val))

Accuracy:
0.934372841211882


In [53]:
# Evaluation of previous word, current word, next word and the word before the previous one

def pos_features_both(sentence, i, history):
    features = {"curr_word": sentence[i]}
    if i == 0:
        features["prev_word"] = "<START>"
        features["2_prev_word"] = "<START>"
    elif i == 1:
        features["prev_word"] = sentence[i-1]
        features["2_prev_word"] = "<START>"
    else:
        features["prev_word"] = sentence[i-1]
        features["2_prev_word"] = sentence[i-2]
    try:
        features["next_word"] = sentence[i+1]
    except:
        features["next_word"] = "<END>"
    return features

lr_tagger = ScikitGreedyTagger(pos_features_both)
lr_tagger.train(news_train)
print("Accuracy:")
print(lr_tagger.accuracy(news_val))

Accuracy:
0.933978091384585


The combination containing the previous word, the current word and the next word works a bit better than the other combinations.

?????

**Exercise	2c:	Adding	transition	features**

Modify the feature function to include the tag predicted at the previous position. Does this help? What 
about a trigram model that includes the two previously predicted tags?

**Exercise	2d:	Even	more	features**

Try to add more features to get an even better tagger. Only the fantasy sets limits to what you may 
consider. Some ideas: Extract suffixes and prefixes from the current, previous or next word. Is the 
current word a number? Is it capitalized? Does it contain capitals? Does it contain a hyphen? etc. What is 
the best feature set you can come up with? Train and test various feature sets and select the best one.
If you use sources for finding tips about good features (like articles, web pages, NLTK code, etc.) make 
references to the sources and explain what you got from them.

**Exercise	2e:	Regularization**

As in the previous assignment, we will study the effect of different regularization strengths now. In scikitlearn, regularization is expressed by the parameter C. A smaller C means stronger regularization. Try with 
C in [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0] and see which value which yields the best result. You can also try 
additional values.

Summarize your experiments to make clear which set of features and parameters provide the best 
results, and what the corresponding accuracy score is. Did you manage to outperform the perceptron 
tagger? If not, where do you think the bottleneck of your current tagger lies?

**Part	3	– Training	and	testing	on	a	larger	corpus**

The Brown corpus covers 15 different genres, but we have only explored the news genre so far. In this 
part, we will retrain the most promising taggers on an extended set of genres and test them on held-out 
data.


**Exercise	3a:	Compile	the	extended	training	and	test	data**

The NLTK book, chapter 2.1.3, lists the names of the 15 genres available in the Brown corpus. We will set 
two genres aside for testing: hobbies and adventure. For training, we will use the news training set 
prepared for the previous exercises, as well as the data from the remaining 12 genres. Prepare the 
corpus as described and store the datasets in the variables all_train, hobbies_test and 
adventure_test. We will not use news_val in this part. Make sure to use the universal tagset.

**Exercise	3b:	Evaluate	the	taggers**

Identify the most successful tagger from part 1 and the best setup from part 2. Retrain both of them on 
all_train and evaluate them separately on the two test genres. Report the results and discuss them 
briefly: Which of the two genres is “easier”? How well do the two taggers generalize to unseen genres?


**Exercise	3c:	Confusion	matrix**

The accuracy gives us a high-level overview of the performance of a tagger, but we may be interested in 
finding out more details about where the tagger makes the mistakes. The universal tagset is reasonably 
small, so we can produce a confusion matrix. Take a look at https://www.nltk.org/api/nltk.tag.api.html
and make a confusion matrix for the results. Pick the results of one test set and one tagger. Make sure 
you understand what the rows and columns are. Which pairs of tags are most easily confounded?
You can find the documentation of the tagset in the following link, but note that NLTK uses an earlier, 
slightly different version of the tagset: https://universaldependencies.org/u/pos/index.html

**Exercise	3d:	Precision,	recall	and	f-measure**

Finding hints on the NLTK web page linked above, calculate the precision, recall and f-measure for each 
tag and display the results in a table.

Also calculate the macro precision, macro recall and macro f-measure across all tags.

**Exercise	3e:	Error	analysis**

Sometimes, it makes sense to inspect the output of a machine learning model more thoroughly. Find five 
sentences in the test set where at least one token is misclassified and display these sentences in the 
following format, with both the predicted and gold tags.

Identify the words that are tagged differently. Comment on each of the differences. Would you say that 
the predicted tag is wrong? Or is there a genuine ambiguity such that both answers are defendable? Or is 
even the gold tag wrong?