<div class="alert alert-danger">
**Due date:** 2017-02-10
</div>

# Lab 3: Part-of-Speech Tagging

**Students:** Johan Lindström (johli160), Jonathan Sjölund (jonsj507)

## Introduction

Part-of-speech (POS) tagging is the task of labelling words (tokens) with parts of speech such as noun, adjective, and verb. In this lab you will implement a POS tagger based on the averaged perceptron and evaluate it on the [Stockholm Umeå Corpus (SUC)](http://spraakbanken.gu.se/eng/resources/suc), a Swedish corpus containing more than 74,000 sentences (1.1&nbsp;million tokens), which were manually annotated with, among others, parts of speech. The corpus is divided into two files:

<table align="left">
<tr><td><code>suc-train.txt</code></td><td style="text-align: right">72,594 sentences</td><td style="text-align: right">1,142,802 tokens</td></tr>
<tr><td><code>suc-test.txt</code></td><td style="text-align: right">1,569 sentences</td><td style="text-align: right">23,319 tokens</td></tr>
</table>

Start by importing the Python module that is required for this lab:

In [1]:
import nlp3

The next cell loads the data:

In [2]:
training_data = nlp3.read_data("/home/TDDE09/labs/nlp3/suc-train.txt")
test_data = nlp3.read_data("/home/TDDE09/labs/nlp3/suc-test.txt")

Both data sets consist of tagged sentences. In Python, a tagged sentence is represented as a list of string pairs, where the first component of each pair represents a word and the second component represents a part-of-speech tag. Run the following code cell to see an example:

In [3]:
training_data[42]

[('Och', 'KN'),
 ('det', 'PN'),
 ('är', 'VB'),
 ('som', 'KN'),
 ('segerherre', 'NN'),
 ('han', 'PN'),
 ('vill', 'VB'),
 ('göra', 'VB'),
 ('politik', 'NN'),
 ('.', 'MAD')]

The next cell extracts all unique tags from the training data. The tags are explained and exemplified in Table&nbsp;12 (page&nbsp;20) of the [SUC 2.0 Manual](https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf).

In [4]:
suc_tags = set()
for tagged_sentence in training_data:
    for word, tag in tagged_sentence:
        suc_tags.add(tag)
suc_tags = sorted(suc_tags)
print(" ".join(suc_tags))

AB DT HA HD HP HS IE IN JJ KN MAD MID NN PAD PC PL PM PN PP PS RG RO SN UO VB


Run the next code cell to train the default tagger, tag the sample sentence from above, and evaluate the tagger on the test data. Note that for reasons of speed, this only uses the first 1,000 sentences of the training data; for higher accuracies you should train on the complete training data.

In [5]:
tagger = nlp3.PerceptronTagger(suc_tags)
tagger.train(training_data[:1000])
print(tagger.tag([word for word, tag in training_data[42]]))
matrix = nlp3.confusion_matrix(tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(matrix)))

Progress: 99.90%
[('Och', 'KN'), ('det', 'PN'), ('är', 'VB'), ('som', 'HP'), ('segerherre', 'VB'), ('han', 'PN'), ('vill', 'VB'), ('göra', 'VB'), ('politik', 'NN'), ('.', 'MAD')]
Accuracy: 75.96%


## Implement the tagger

Your main task in this lab is to re-implement the two central methods of the default tagger:

* `train()`, which takes a list of tagged sentences and trains the tagger using the averaged perceptron learning algorithm

* `tag()`, which takes a list of words (strings) and returns a tagged sentence

You are of course free to add other methods to your class if you deem it appropriate to do so.

In implementing the tagger you will be able to reuse code from your implementation of the averaged perceptron classifier in lab&nbsp;1. However, for this lab it is crucial that you can handle multiple classes, as the tagger needs one class per POS tag.

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Implement a part-of-speech tagger based on the averaged perceptron, train it on the training data, and evaluate performance on the test data. Your tagger should get the same results as the default tagger.
</div>
</div>

Starter code for this problem is given in the following code cell. The provided class simply inherits from `nlp3.PerceptronTagger` and calls the methods in the superclass. Your task is to replace these calls with your own code. You will note that there is a third method `get_features()`; you do not need to touch this method unless you want to do the advanced part of this lab (see below).

In [6]:
class OurTagger(nlp3.PerceptronTagger):
    
    
    # Added code
    def predict(self,x):
        #import random
        scores = {}
        for tag_class in self.classes:      
            scores.update({tag_class:0}) #Create a dict with all the tags      
        for tag in self.classes:
            for f in x:
                if(f in self.weights[tag]):
                    scores.update({tag:scores[tag]+self.weights[tag][f]})
                else:
                    self.weights[tag].update({f:0})
                    self.acc[tag].update({f:0})
                    scores.update({tag:scores[tag]+self.weights[tag][f]})
        
        #Tittar om alla har lika mycket score, om de har de så väljer vi någon specific annars så tar den med störst score
        if(scores[max(scores, key=lambda tag: scores[tag])] >0): 
            return max(scores, key=lambda tag: scores[tag])
        else:
            #print(random.choice(self.classes))
            #return random.choice(self.classes) 
            return self.classes[9] #Tag 9, störst hittills med 76.31 %
        

    def __init__(self, tags):
        """Creates a new tagger that uses the specified tag set."""
        super().__init__(tags)
        self.tags = tags

    def tag(self, words):
        """Tags the specified words, returning a tagged sentence."""
        pred_tags = []
        tagged_words = []
        for i in range(0,len(words)):
            f = self.get_features(words,i,pred_tags) #Hämtar features för den hämtade token+tagg
            p = self.predict(f) #Predicera tagg mha. features
            pred_tags.append(p)
            temp = (words[i],p)
            tagged_words.append(temp)
        
        return tagged_words 
    
        #TODO: Replace the following line with your own code
        #return super().tag(words)
    
    def train(self, tagged_sentences, report_progress=True):
        """Trains this tagger on the specified gold-standard data."""
        suc_tags = set()
        for tagged_sentence in training_data:
            for word, tag in tagged_sentence:
                suc_tags.add(tag)
        self.classes = sorted(suc_tags) #Skapar en lista med varje klass
        
        self.weights = {}
        self.acc = {}
        for tag_class in suc_tags: #Går igenom varje möjlig tagg och skapar en dict av dict för den    
            self.weights.update({tag_class:{}})
            self.acc.update({tag_class:{}})    
            
        count = 1
        for x in tagged_sentences: #En mening där varje ord har en tagg
            tokens = []
            tags = []
            for tag_token in x: #Går igenom meningen och tar ut tokenet och taggen för sig och lägger i varsin lista
                tokens.append(tag_token[0])
                tags.append(tag_token[1])
            
            pred_tags = []    
            for i in range(0,len(x)): #Går igenom varje mening och hämtar token+tagg
                f = self.get_features(tokens,i,pred_tags) #Hämtar features för den hämtade token+tagg
                p = self.predict(f) #Predicera tagg mha. features
                pred_tags.append(p)
                y = tags[i] #Den korrekta taggen för token
                
                if(p != y):
                    for feature in f:#Går igenom varje feature om predict inte matcher med rätt tagg
                        if(feature in self.weights[p]): 
                            self.weights[p][feature] -= 1 #Minskar weighten för features i fel tagg
                            self.acc[p][feature] -= count
                            
                        if(feature in self.weights[y]):
                            self.weights[y][feature] += 1 #Ökar weighten för features soma ska vara i rätt tagg
                            self.acc[y][feature] += count
            count += 1
        for k in suc_tags:
            for word in self.weights[k]:
                self.weights[k][word] = self.weights[k][word] - self.acc[k][word]/count #Averaging
                
        # TODO: Replace the following line with your own code
        #super().train(tagged_sentences, report_progress)

    def get_features(self, tokens, i, pred_tags):
        """Extracts the feature list for the specified configuration."""
        # TODO: For the advanced part, replace the following line with your own code
        features = list();
     
                
        x = len(pred_tags)
        
        features.append(tokens[i])
        features.append(tokens[i])
        features.append(tokens[i])
        features.append(tokens[i])

        
        if(x > 0):
            features.append(pred_tags[i-1])
            features.append(tokens[i-1])
            features.append((pred_tags[i-1],tokens[i-1]))
            features.append((tokens[i-1],tokens[i]))
        else:
            features.append("BOSTAG")
            features.append("BOS")
            features.append(("BOSTAG","BOS"))
            features.append(("BOS",tokens[i]))

                
        
        if(i == len(tokens)-1):
            features.append(("EOS",tokens[i]))
        else:
            features.append((tokens[i+1],tokens[i]))
        
        return features; 
        
        #return super().get_features(tokens, i, pred_tags)

Run the following cell to test your tagger. At the end of the lab you should get the same results as in the evaluation of the default tagger (assuming that you do not change the feature extraction, see below).

In [7]:
our_tagger = OurTagger(suc_tags)
our_tagger.train(training_data[:])
print(our_tagger.tag([word for word, tag in training_data[42]]))
our_matrix = nlp3.confusion_matrix(our_tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(our_matrix)))

[('Och', 'KN'), ('det', 'PN'), ('är', 'VB'), ('som', 'KN'), ('segerherre', 'NN'), ('han', 'PN'), ('vill', 'VB'), ('göra', 'VB'), ('politik', 'NN'), ('.', 'MAD')]
Accuracy: 93.12%


In what follows, we try to give you an idea of what the two methods `train()` and `tag()` do. We start with the latter.

### Tagging

The default tagger implements the sequence model presented in the lecture. In this model, sentences are tagged from left to right. A **configuration** consists of the list of words, the index of the current word, and the list of already predicted tags. For each word in the sentence, the tagger calls the method `get_features()` to obtain a feature vector for the current configuration. To illustrate how this works, we define a variant of the default tagger that only extracts a single feature, the current word.

In [8]:
class DemoTagger(nlp3.PerceptronTagger):
    
    def get_features(self, words, i, pred_tags):
        if self.debug:
            print("words: {}".format(" ".join(words)))
            print("i: {} (current word: {})".format(i, words[i]))
            print("pred_tags: {}".format(" ".join(pred_tags)))
            print()
        return [words[i]]

We train this tagger and evaluate it:

In [9]:
demo_tagger = DemoTagger(suc_tags)
demo_tagger.debug = False
demo_tagger.train(training_data[:1000])
demo_matrix = nlp3.confusion_matrix(demo_tagger, test_data)
print("Accuracy: {:.2%}".format(nlp3.accuracy(demo_matrix)))
demo_tagger.debug = True

Progress: 99.90%
Accuracy: 69.18%


Here are the features that are extracted when the system tags the sentence *Anna älskar Kurt*:

In [10]:
demo_tagger.tag("Anna älskar Kurt".split())

words: Anna älskar Kurt
i: 0 (current word: Anna)
pred_tags: 

words: Anna älskar Kurt
i: 1 (current word: älskar)
pred_tags: PM

words: Anna älskar Kurt
i: 2 (current word: Kurt)
pred_tags: PM VB



[('Anna', 'PM'), ('älskar', 'VB'), ('Kurt', 'PM')]

Note that a feature vector is represented as a list of features. With this vector, the tagger then predicts the next tag using the classification rule for the perceptron, and updates the configuration before moving on to the next word. Finally, `tag()` returns the tagged sentence.

### Training

Training is based on the learning algorithm for the averaged perceptron. Note that the weight vectors need to be updated for each word, not for each sentence. The tagger maintains a list of already predicted tags as part of its configuration. The tagger trains for a single epoch.

## Advanced: Feature engineering

In the advanced part of this lab, you will practice your skills in **feature engineering**, the task of identifying useful features for a machine learning system.

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Think about which features could be useful for tagging and re-implement the method `get_features()` in the class `OurTagger` accordingly. Experiment not only with atomic features but also with different feature combinations (pairs or tuples of features). The goal is to create a system whose accuracy on the test data is as high as possible. Provide a short description of how you came up with your features.
</div>
</div>

Our first idea was to create a feature window that covers the words infront and behind our word of intreset. For the word infront we also included the predicted tag as a feature. To improve the the accuracy we thought that giving the word of interest more weight would help. We beleive that this feature is the most important one to predict the tag of it. To increase the weight of it we simply added more features of the word in focus. We testet some numbers of the same feature and kept the best one. To improve the accuracy further we tried to create tuples as features. Because we belive that the word in intrest is the most imporant we chose to combine this feature with the other features. We tried a couple of different combination and our final and best implementation achieves around 93% accuracy.     