# DAgger on Part of Speech tagging

This notebook shows how to run the imitation learning algorithm DAgger (Dataset Aggregation, [Ross et al. (2011)](https://arxiv.org/pdf/1011.0686.pdf)) on a toy part of speech tagging dataset, showcasing its benefits. It follows the terminology of the EACL 2017 tutorial on imitation learning for structured prediction ([Vlachos et al. 2017](http://sheffieldnlp.github.io/ImitationLearningTutorialEACL2017/)) and the code from this [github repository](http://github.com/andreasvlachos/structured_imitation_demo). The latter uses [scikit-learn](http://scikit-learn.org/stable/) classifiers in Python3 to faciliate adoptions by academic researchers and software developers. The notebook follows closely the code in this [file](http://github.com/andreasvlachos/structured_imitation_demo/blob/master/src/POSdemo.py), if you would rather go straight there.

In what follows we show how to do this step-by-step. First import the library:

In [8]:
import imitation

Define the (typically structured) input and the structured output, combined in an instance:

In [9]:
class POSInput(imitation.StructuredInput):
    def __init__(self, tokens):
        self.tokens = tokens  

class POSOutput(imitation.StructuredOutput):
    def __init__(self, tags=None):
        self.tags = []
        if tags!=None:
            self.tags = tags

class POSInstance(imitation.StructuredInstance):
    def __init__(self, tokens, tags=None):
        super().__init__()
        self.input = POSInput(tokens)
        self.output = POSOutput(tags)

Most of the work is defining the transition system. The package has a class ```TransitionSystem``` that helps define it. See the comments in the code for some hints about its construction: 

In [11]:
class POSTransitionSystem(imitation.TransitionSystem):

    class WordAction(imitation.TransitionSystem.Action):
        def __init__(self):
            # The superclass constructor initializes the label and the features that each action has
            super().__init__()

    # the agenda for word prediction is one action per token, left-to-right
    def __init__(self, structured_instance=None):
        super().__init__(structured_instance)
        if structured_instance == None:
            return
        for tokenNo, token in enumerate(structured_instance.input.tokens):
            newAction = self.WordAction()
            newAction.tokenNo = tokenNo
            self.agenda.append(newAction)

    # the expert policy is trivial in the case of PoS tagging: just return the correct label from gold
    def expert_policy(self, structured_instance, action):
        # just return the next action
        return structured_instance.output.tags[action.tokenNo]

    # Here we could be doing more book-keeping to help extract more complex features
    def updateWithAction(self, action, structuredInstance):
        # add it as an action though
        self.actionsTaken.append(action)

    # The feature engineering goes here
    def extractFeatures(self, structured_instance, action):
        # e.g the word itself that we are tagging
        features = {"currentWord=" + structured_instance.input.tokens[action.tokenNo]: 1}

        # features based on the previous predictionsof this stage are to be accessed via the self.actionsTaken
        # e.g. the previous action
        if len(self.actionsTaken) > 0:
            features["prevPrediction=" + self.actionsTaken[-1].label] = 1
        else:
            features["prevPrediction=NULL"] = 1

        return features

    # Convert the action sequence in the state to the actual prediction, i.e. a sequence of tags
    def to_output(self):
        tags = []
        for action in self.actionsTaken:
            tags.append(action.label)
        return POSOutput(tags)

This is it! We can now write the following which specifies that our tagger will be learned by the ```ImitationLearner``` with the ```POSTransitionSystem```:

In [12]:
class POSTagger(imitation.ImitationLearner):
    # specify the transition system
    transitionSystem = POSTransitionSystem

    def __init__(self):
        super().__init__()

Let's create a toy dataset and an instance of the tagger:

In [14]:
trainingInstances = []

# two instances
trainingInstances.extend([POSInstance(["I", "can", "fly"], ["Pronoun", "Modal", "Verb"])])
trainingInstances.extend([POSInstance(["I", "can", "meat"], ["Pronoun", "Verb", "Noun"])])

# repeated multiple times
trainingInstances.extend(30*[POSInstance(["I", "can", "fly"], ["Pronoun", "Modal", "Verb"])])
trainingInstances.extend(10*[POSInstance(["I", "can", "meat"], ["Pronoun", "Verb", "Noun"])])

tagger = POSTagger()

Given the features (current word, previous tag), the classifier receives an ambiguous training signal. Thus it will learn to predict only one of the two cases of "can" correctly, the one with more appearances in the training data. Of course more complex features could address this, but generally speaking we rarely (want to) have features that are too complex as they tend to be sparse and not generalize. Let's see what happens here when training with 1 iteration of DAgger, which is the equivalent of standard supervised training, also referred to as exact imitation:

In [None]:
    params.iterations = 1

    tagger.train(trainingInstances, params)

    print(tagger.labelEncoder.classes_)
    print(tagger.vectorizer.inverse_transform(tagger.model.coef_))

    print(tagger.predict(trainingInstances[0]).to_output())
    print(tagger.predict(trainingInstances[1]).to_output())


### Acknowledgments

I have been working on imitation learning for structured prediction for many years with many great collaborators: Special thanks to Gerasimos Lampouras and Sebastian Riedel who worked with me on preparing the aforementioned EACL 2017 tutorial.