# DAgger on Part of Speech tagging

This notebook shows how to run the imitation learning algorithm DAgger (Dataset Aggregation, [Ross et al. (2011)](https://arxiv.org/pdf/1011.0686.pdf)) on a toy part of speech tagging dataset, showcasing its benefits. It follows the terminology of the EACL 2017 tutorial on imitation learning for structured prediction ([Vlachos et al. 2017](http://sheffieldnlp.github.io/ImitationLearningTutorialEACL2017/)) and the code from this [github repository](http://github.com/andreasvlachos/structured_imitation_demo). The latter uses [scikit-learn](http://scikit-learn.org/stable/) classifiers in Python3 to faciliate adoptions by academic researchers and software developers. The notebook follows closely the code in this [file](http://github.com/andreasvlachos/structured_imitation_demo/blob/master/src/POSdemo.py), if you would rather go straight there.

In what follows we show how to do this step-by-step. First get the code from the [github repository](http://github.com/andreasvlachos/structured_imitation_demo) and import the library:

In [8]:
import imitation

Define the (typically structured) input and the structured output, combined in an instance:

In [22]:
class POSInput(imitation.StructuredInput):
    def __init__(self, tokens):
        self.tokens = tokens  
    
    # this is just to help us print things
    def __str__(self):
        return " ".join(self.tokens)

class POSOutput(imitation.StructuredOutput):
    def __init__(self, tags=None):
        self.tags = []
        if tags!=None:
            self.tags = tags
            
    # this is just to help us print things
    def __str__(self):
        return " ".join(self.tags)
            

class POSInstance(imitation.StructuredInstance):
    def __init__(self, tokens, tags=None):
        super().__init__()
        self.input = POSInput(tokens)
        self.output = POSOutput(tags)

Most of the work is defining the transition system. The package has a class ```TransitionSystem``` that helps define it. See the comments in the code for some hints about its construction: 

In [23]:
class POSTransitionSystem(imitation.TransitionSystem):

    class WordAction(imitation.TransitionSystem.Action):
        def __init__(self):
            # The superclass constructor initializes the label and the features that each action has
            super().__init__()

    # the agenda for word prediction is one action per token, left-to-right
    def __init__(self, structured_instance=None):
        super().__init__(structured_instance)
        if structured_instance == None:
            return
        for tokenNo, token in enumerate(structured_instance.input.tokens):
            newAction = self.WordAction()
            newAction.tokenNo = tokenNo
            self.agenda.append(newAction)

    # the expert policy is trivial in the case of PoS tagging: just return the correct label from gold
    def expert_policy(self, structured_instance, action):
        # just return the next action
        return structured_instance.output.tags[action.tokenNo]

    # Here we could be doing more book-keeping to help extract more complex features
    def updateWithAction(self, action, structuredInstance):
        # add it as an action though
        self.actionsTaken.append(action)

    # The feature engineering goes here
    def extractFeatures(self, structured_instance, action):
        # e.g the word itself that we are tagging
        features = {"currentWord=" + structured_instance.input.tokens[action.tokenNo]: 1}

        # features based on the previous predictionsof this stage are to be accessed via the self.actionsTaken
        # e.g. the previous action
        if len(self.actionsTaken) > 0:
            features["prevPrediction=" + self.actionsTaken[-1].label] = 1
        else:
            features["prevPrediction=NULL"] = 1

        return features

    # Convert the action sequence in the state to the actual prediction, i.e. a sequence of tags
    def to_output(self):
        tags = []
        for action in self.actionsTaken:
            tags.append(action.label)
        return POSOutput(tags)

This is it! We can now write the following which specifies that our tagger will be learned by the ```ImitationLearner``` with the ```POSTransitionSystem```:

In [24]:
class POSTagger(imitation.ImitationLearner):
    # specify the transition system
    transitionSystem = POSTransitionSystem

    def __init__(self):
        super().__init__()

Let's create a toy dataset and an instance of the tagger:

In [25]:
trainingInstances = []

# two instances
trainingInstances.extend([POSInstance(["I", "can", "fly"], ["Pronoun", "Modal", "Verb"])])
trainingInstances.extend([POSInstance(["I", "can", "meat"], ["Pronoun", "Verb", "Noun"])])

# repeated multiple times
trainingInstances.extend(30*[POSInstance(["I", "can", "fly"], ["Pronoun", "Modal", "Verb"])])
trainingInstances.extend(10*[POSInstance(["I", "can", "meat"], ["Pronoun", "Verb", "Noun"])])

tagger = POSTagger()

Given the features (current word, previous tag), the classifier receives an ambiguous training signal. Thus it will learn to predict only one of the two cases of "can" correctly, the one with more appearances in the training data. Of course more complex features could address this, but generally speaking we rarely (want to) have features that are too complex as they tend to be sparse and not generalize. Let's see what happens here when training with 1 iteration of DAgger, which is the equivalent of standard supervised training, also referred to as exact imitation:

In [30]:
params = POSTagger.params()
params.iterations = 1

tagger.train(trainingInstances, params)

print(trainingInstances[0].input)
print(tagger.predict(trainingInstances[0]).to_output())
print(trainingInstances[1].input)
print(tagger.predict(trainingInstances[1]).to_output())

Iteration:0, expert policy prob:1.0
I can fly
Pronoun Modal Verb
I can meat
Pronoun Modal Verb


As expected, we cannot learn that "can" can have different tags depending on the context, but also we get another error that "meat" is tagged as a verb! This is happening because, even though we always saw "meat" only as a noun in our data, we also saw more instances of the tag "Modal" followed by a "Verb" tag. While this is a useful pattern to learn, we never encountered cases where the "Modal" tag is actually a mistake, since the training data labels are (typically) correct. In other words, we need to expose our model to incorrect predictions, so that it learns to recover from them. This is what we will do by adding a second iteration of DAgger, in which the roll-in will be with the classifier exclusively:

In [31]:
paramsImit = POSTagger.params()
paramsImit.iterations = 2
# The formula for probability of using the expert policy per iteration is (1-params.learningParam)^iteration_no
# iteration_no starts from 0, so 1 in the first iteration and 1-learningParam in the second one
paramsImit.learningParam = 1 

tagger.train(trainingInstances, paramsImit)

print(trainingInstances[0].input)
print(tagger.predict(trainingInstances[0]).to_output())
print(trainingInstances[1].input)
print(tagger.predict(trainingInstances[1]).to_output())

Iteration:0, expert policy prob:1
Iteration:1, expert policy prob:0
I can fly
Pronoun Modal Verb
I can meat
Pronoun Modal Noun


The model now has learned to avoid the second mistake of tagging "meat" as a "Verb", thanks to the additional training data generated in the second generation of DAgger. Of course more features could have achieved this too, but we are unlikely to have the perfect classifier, so it is better to prepare it for its own mistakes.

I intend to keep working on improving this tutorial and codebase to add the notion of roll-outs and training against non-decomposoable losses. If you see any mistakes, or you have any questions or requests, I would be more than happy to hear them: [a.vlachos@sheffield.ac.uk](mailto:a.vlachos@sheffield.ac.uk).

### Acknowledgments

I have been working on imitation learning for structured prediction for many years with many great collaborators. However, special thanks are due to Gerasimos Lampouras and Sebastian Riedel who worked with me on preparing the aforementioned EACL 2017 tutorial.