### POS tagging with maximum entropy models (10 pts)

In this task you will build a maximum entropy model for part-of-speech tagging. As the name suggests, our problem is all about converting a sequence of words into a sequence of part-of-speech tags. 
<img src=https://i.stack.imgur.com/6pdIT.png width=320>


__Your man goal:__ implement the model from [the article you're given](W96-0213.pdf).

Unlike previous tasks, this one gives you greater degree of freedom and less automated tests. We provide you with programming interface but nothing more.

__A piece of advice:__ there's a lot of objects happening here. If you don't understand why some object is needed, find `def train` function and see how everything is linked together.


### Part I: reading input data

In [1]:
import collections
import itertools

In [2]:
# Data types:
# Word: str
# Sentence: list of str
TaggedWord = collections.namedtuple('TaggedWord', ['text', 'tag'])
# TaggedSentence: list of TaggedWord
# Tags: list of TaggedWord
# TagLattice: list of Tags

def read_tagged_sentences(path):
    """
    Read tagged sentences from CoNLL-U file and return array of TaggedSentence (array of lists of TaggedWord).
    """
    with open(path, encoding='utf8') as f:
        sentences = list(map(lambda x: x.strip(), filter(lambda x: not x.startswith('# '), f.readlines())))
        sentences = [list(group) for k, group in itertools.groupby(sentences, lambda x: x == '') if not k]
        
        tagged_sentences = []
        for sentence in sentences:
            tagged_words = []
            for word_definition in sentence:
                word_parts = word_definition.split('\t')
                tagged_words.append(TaggedWord(text=word_parts[1], tag=word_parts[3]))
            tagged_sentences.append(tagged_words)
        
        return tagged_sentences

def write_tagged_sentence(tagged_sentence, f):
    """
    Write tagged sentence in CoNLL-U format to file-like object f.
    """
    get_word_def = lambda i: f'{i + 1}\t{tagged_sentence[i].text}\t_\t{tagged_sentence[i].tag}' + '\t_'*6 + '\n'
    f.writelines(map(get_word_def, range(len(tagged_sentence))))

def read_tags(path):
    """
    Read a list of possible tags from file and return the list.
    """
    with open(path, encoding='utf8') as f:
        return list(filter(lambda x: x != '', f.read().split('\n')))

In [3]:
assert read_tags('data/tags') == ['NOUN','PUNCT','VERB','PRON','ADP','DET','PROPN','ADJ','AUX','ADV','CCONJ','PART','NUM','SCONJ','X','INTJ','SYM']

In [4]:
tagged_sentences = read_tagged_sentences('data/en-ud-train.conllu')
with open('temp', 'w+') as f:
    write_tagged_sentence(tagged_sentences[0], f)

### Part II: evaluation

We want you to estimate tagging quality by a simple accuracy: a fraction of tag predictions that turned out to be correct - averaged over the entire training corpora.

In [5]:
# Data types:
TaggingQuality = collections.namedtuple('TaggingQuality', ['acc'])

def tagging_quality(ref, out):
    """
    Compute tagging quality and reutrn TaggingQuality object.
    """
    nwords = 0
    ncorrect = 0
    for ref_sentence, out_sentence in itertools.zip_longest(ref, out):
        for ref_word, out_word in itertools.zip_longest(ref_sentence, out_sentence):
            if ref_word and out_word and ref_word.tag == out_word.tag:
                ncorrect += 1
            nwords += 1
    return ncorrect / nwords

### Part III: Value and Update

In order to implement two interlinked data structures: 
* __Value__ - a class that holds POS tagger's parameters. Basically an array of numbers
* __Update__ - a class that stores updates for Value

In [6]:
import numpy as np

class Value:
    def __init__(self, n):
        """
        Dense object that holds parameters.
        :param n: array length
        """
        self.positions = np.arange(n)
        self.values = np.zeros(n)

    def dot(self, update):
        if not isinstance(update, Update):
            raise ValueError(f'Expected update to be Update, got {type(update)}')
        return self.values[update.positions].dot(update.values)

    def assign(self, other):
        """
        self = other
        other is Value.
        """
        if not isinstance(other, Value):
            raise ValueError(f'Expected other to be Value, got {type(other)}')
        self.values = np.array(other.values)

    def assign_mul(self, coeff):
        """
        self = self * coeff
        coeff is float.
        """
        self.values *= coeff

    def assign_madd(self, x, coeff):
        """
        self = self + x * coeff
        x can be either Value or Update.
        coeff is float.
        """
        if not isinstance(x, Value) and not isinstance(x, Update):
            raise ValueError(f'Expected x to be Value or Update, got {type(other)}')
        if len(x.positions) == 0:
            return
        self.values[x.positions] += x.values*coeff

    def __repr__(self):
        return f'Value({self.positions}, {self.values})'

class Update:
    """
    Sparse object that holds an update of parameters.
    """

    def __init__(self, positions=None, values=None):
        """
        positions: array of int
        values: array of float
        """
        self.positions = np.array(positions if positions else [], dtype=int)
        self.values = np.array(values if values else [])
        if self.positions.shape != self.values.shape:
            raise ValueError(f'Expected positions and values shape to be equal, got: {self.positions.shape}, {self.values.shape}')

    def assign_mul(self, coeff):
        """
        self = self * coeff
        coeff: float
        """
        self.values *= coeff

    def assign_madd(self, update, coeff):
        """
        self = self + update * coeff
        coeff: float
        """
        if not isinstance(update, Update):
            raise ValueError(f'Expected update to be Update, got {type(update)}')
        
        new_positions = np.hstack([self.positions, update.positions])
        new_values = np.hstack([self.values, update.values*coeff])
        
        pos_to_value = collections.defaultdict(float)
        for position, value in zip(new_positions, new_values):
            pos_to_value[position] += value
        
        pos_and_value = [x for x in pos_to_value.items() if x[1] != 0]
        positions, values = ([], []) if len(pos_and_value) == 0 else zip(*pos_and_value)
        
        self.positions = np.array(positions, dtype=int)
        self.values = np.array(values)
    
    def __repr__(self):
        return f'Update({self.positions}, {self.values})'

In [7]:
v = Value(5)
assert (v.positions == np.array([0, 1, 2, 3, 4])).all()
assert (v.values == np.array([0, 0, 0, 0, 0])).all()

other = Value(5)
other.values = np.array([1, 2, 1, 2, 2])
v.assign(other)
assert (v.values == np.array([1, 2, 1, 2, 2])).all()

v.assign_mul(2)
assert (v.values == np.array([2, 4, 2, 4, 4])).all()
assert (other.values == np.array([1, 2, 1, 2, 2])).all()

v.assign_madd(other, 3)
assert (v.values == np.array([5, 10, 5, 10, 10])).all()

u = Update([1, 2], [3,2])
assert v.dot(u) == 40

v.values = np.array([5, 30, 10, 10, 10])
v.assign_madd(u, 2)
assert (v.values == np.array([5, 36, 14, 10, 10])).all()

u.assign_mul(2)
assert (u.values == np.array([6, 4])).all()

u.assign_madd(Update([2, 3], [4, 5]), 2)
assert (u.positions == np.array([1, 2, 3])).all()
assert (u.values == np.array([6, 12, 10])).all()
u.assign_madd(Update([0, 1, 0], [1, -1, 1]), 2)
assert (u.positions == np.array([1, 2, 3, 0])).all()
assert (u.values == np.array([4, 12, 10, 4])).all()

In [8]:
v.assign_madd(Update(), 1)

### Part IV: Maximum Entropy POS Tagger
_step 1 - draw an oval; step 2 - draw the rest of the owl (c)_

In this secion you will implement a simple linear model to predict POS tags.
Make sure you [read the article](W96-0213.pdf) before you proceed.

In [9]:
# Data Types:
Feature = collections.namedtuple('Feature', ['name', 'index', 'value', 'tag'])
Features = Update
Hypo = collections.namedtuple('Hypo', ['prev', 'pos', 'tagged_word', 'score'])
# prev: previous Hypo
# pos: position of word (0-based)
# tagged_word: tagging of source_sentence[pos]
# score: sum of scores over edges

TaggerParams = collections.namedtuple('FeatureParams', [
    'src_window',
    'dst_order',
    'max_suffix',
    'beam_size',
    'nparams'
    ])

#import cityhash
def h(x):
    """
    Compute CityHash of any object.
    Can be used to construct features.
    """
    return hash(repr(x))
    #return cityhash.CityHash64(repr(x))

In [10]:
class LinearModel:
    """
    A thing that computes score and gradient for given features.
    """

    def __init__(self, n):
        self._params = Value(n)

    def params(self):
        return self._params

    def score(self, features):
        """
        features: Update
        """
        return self._params.dot(features)

    def gradient(self, features, score):
        return features

In [11]:
class FeatureComputer:
    def __init__(self, tagger_params, source_sentence):
        self.tagger_params = tagger_params
        self.source_sentence = source_sentence
        self.source_len = len(source_sentence)
    
    def get_by_index(self, index):
        return TaggedWord(None, None) if index < 0 or index >= self.source_len else self.source_sentence[index]
    
    def compute_features(self, hypo, debug=False):
        """
        Compute features for a given Hypo and return Update.
        """
        word = hypo.tagged_word.text
        tag = hypo.tagged_word.tag
        pos = hypo.pos
        
        features = []
        features.append(Feature(name='word', index=0, value=word, tag=tag))
        
        for i in range(1, self.tagger_params.src_window + 1):
            features.append(Feature(name='word', index=-i, value=self.get_by_index(pos - i).text, tag=tag))
        
        for i in range(1, self.tagger_params.src_window + 1):
            features.append(Feature(name='word', index=i, value=self.get_by_index(pos + i).text, tag=tag))
    
        for i in range(1, self.tagger_params.dst_order):
            indices = tuple(range(-i, 0))
            tags = tuple(map(lambda x: self.get_by_index(pos + x).tag, indices))
            features.append(Feature(name='tag', index=indices, value=tags, tag=tag))
        
        for i in range(1, self.tagger_params.max_suffix + 1):
            features.append(Feature(name='prefix', index=i, value=word[:i], tag=tag))
        
        for i in range(1, self.tagger_params.max_suffix + 1):
            features.append(Feature(name='suffix', index=i, value=word[-i:], tag=tag))
        
        features.append(Feature(name='contains number', index=None, value=any(c.isdigit() for c in word), tag=tag))
        features.append(Feature(name='contains uppercase character', index=None, value=any(c.isupper() for c in word), tag=tag))
        features.append(Feature(name='contains hyphen', index=None, value='-' in word, tag=tag))
        
        if debug:
            print('\n'.join(map(repr, features)))
        
        hashed_features = list(map(lambda x: h(x) % self.tagger_params.nparams, features))
        return Update(positions=hashed_features, values=[1]*len(hashed_features))

In [12]:
sentence = ['the', 'stories', 'about', 'well-heeled', 'communities', 'and', 'developers']
tags = ['DT', 'NNS', 'IN', 'JJ', 'NNS', 'CC', 'NNS']

tagged_sentence = [TaggedWord(text=sentence[i], tag=tags[i]) for i in range(len(sentence))]

hypos = []
for i, tagged_word in enumerate(tagged_sentence):
    hypos.append(Hypo(prev=None if i == 0 else hypos[i - 1], pos=i, tagged_word=tagged_word, score=0.5))

tagger_params = TaggerParams(src_window=2, dst_order=3, max_suffix=4, beam_size=4, nparams=16)
fc = FeatureComputer(tagger_params, tagged_sentence)
fc.compute_features(hypos[2], debug=True)
print()
fc.compute_features(hypos[3], debug=True)

Feature(name='word', index=0, value='about', tag='IN')
Feature(name='word', index=-1, value='stories', tag='IN')
Feature(name='word', index=-2, value='the', tag='IN')
Feature(name='word', index=1, value='well-heeled', tag='IN')
Feature(name='word', index=2, value='communities', tag='IN')
Feature(name='tag', index=(-1,), value=('NNS',), tag='IN')
Feature(name='tag', index=(-2, -1), value=('DT', 'NNS'), tag='IN')
Feature(name='prefix', index=1, value='a', tag='IN')
Feature(name='prefix', index=2, value='ab', tag='IN')
Feature(name='prefix', index=3, value='abo', tag='IN')
Feature(name='prefix', index=4, value='abou', tag='IN')
Feature(name='suffix', index=1, value='t', tag='IN')
Feature(name='suffix', index=2, value='ut', tag='IN')
Feature(name='suffix', index=3, value='out', tag='IN')
Feature(name='suffix', index=4, value='bout', tag='IN')
Feature(name='contains number', index=None, value=False, tag='IN')
Feature(name='contains uppercase character', index=None, value=False, tag='IN')
Fe

Update([13  5  8  3  7  8  5 10  8  0  0  5  4 13  6  2  6  4], [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1])

### Part V: Beam search

We can find the most likely tagging approximately using Beam Search. As everything else, it comes with a separate interface.

In [13]:
import itertools

def with_score(hypo, score):
    return Hypo(prev=hypo.prev, pos=hypo.pos, tagged_word=hypo.tagged_word, score=score)

class BeamSearchTask:
    """
    An abstract beam search task. Can be used with beam_search() generic 
    function.
    """

    def __init__(self, tagger_params, source_sentence, model, tags):
        self.tagger_params = tagger_params
        self.source_sentence = source_sentence
        self.model = model
        self.tags = tags
        self.feature_computer = FeatureComputer(tagger_params, source_sentence)

    def total_num_steps(self):
        """
        Number of hypotheses between beginning and end (number of words in
        the sentence).
        """
        return len(self.source_sentence)

    def beam_size(self):
        return self.tagger_params.beam_size

    def expand(self, hypo):
        """
        Given Hypo, return a list of its possible expansions.
        'hypo' might be None -- return a list of initial hypos then.

        Compute hypotheses' scores inside this function!
        """
        prev_score = hypo.score if hypo else 0
        new_pos = hypo.pos + 1 if hypo else 0
        
        hypos = []
        if new_pos == len(self.source_sentence):
            return hypos
        
        for tag in self.tags:
            word = TaggedWord(self.source_sentence[new_pos].text, tag)
            new_hypo = Hypo(prev=hypo, pos=new_pos, tagged_word=word, score=0)
            score = self.model.score(self.feature_computer.compute_features(new_hypo))
            hypos.append(with_score(new_hypo, prev_score + score))
            
        return hypos

    def recombo_hash(self, hypo, debug=False):
        """
        If two hypos have the same recombination hashes, they can be collapsed
        together, leaving only the hypothesis with a better score.
        """
        tags = []
        for i in range(self.tagger_params.dst_order):
            tags.append(hypo.tagged_word.tag if hypo else None)
            hypo = hypo.prev if hypo else None
        
        if debug:
            print(tags)
            
        return h(tuple(tags))


def beam_search(beam_search_task):
    """
    Return list of stacks.
    Each stack contains several hypos, sorted by score in descending 
    order (i.e. better hypos first).
    """
    beam_size = beam_search_task.beam_size()
    n_steps = beam_search_task.total_num_steps()
    stacks = [[None]]
    for _ in range(n_steps):
        new_hypos = []
        for prev_hypo in stacks[-1]:
            new_hypos.extend(beam_search_task.expand(prev_hypo))
            
        unique_hash_hypos = []
        for _, same_hash_hypos in itertools.groupby(new_hypos, beam_search_task.recombo_hash):
            unique_hash_hypos.append(max(same_hash_hypos, key=lambda x: x.score))
            
        stacks.append(sorted(unique_hash_hypos, key=lambda x: -x.score)[:beam_size])
        
    return stacks[1:]

In [14]:
model = LinearModel(tagger_params.nparams)
bst = BeamSearchTask(tagger_params, tagged_sentence, model, tags[:6])
bst.expand(None)

[Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='DT'), score=0.0),
 Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='NNS'), score=0.0),
 Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='IN'), score=0.0),
 Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='JJ'), score=0.0),
 Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='NNS'), score=0.0),
 Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='CC'), score=0.0)]

In [15]:
get_hypo = lambda prev, i: Hypo(prev=prev, pos=i, tagged_word=TaggedWord(None, 'DT'), score=0)
bst.recombo_hash(get_hypo(None, 0), debug=True), bst.recombo_hash(get_hypo(get_hypo(None, 0), 1), debug=True)

['DT', None, None]
['DT', 'DT', None]


(-735786781911998934, -2662653630755002111)

In [16]:
stacks = beam_search(bst)
stacks[2]

[Hypo(prev=Hypo(prev=Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='DT'), score=0.0), pos=1, tagged_word=TaggedWord(text='stories', tag='DT'), score=0.0), pos=2, tagged_word=TaggedWord(text='about', tag='DT'), score=0.0),
 Hypo(prev=Hypo(prev=Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='DT'), score=0.0), pos=1, tagged_word=TaggedWord(text='stories', tag='DT'), score=0.0), pos=2, tagged_word=TaggedWord(text='about', tag='NNS'), score=0.0),
 Hypo(prev=Hypo(prev=Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='DT'), score=0.0), pos=1, tagged_word=TaggedWord(text='stories', tag='DT'), score=0.0), pos=2, tagged_word=TaggedWord(text='about', tag='IN'), score=0.0),
 Hypo(prev=Hypo(prev=Hypo(prev=None, pos=0, tagged_word=TaggedWord(text='the', tag='DT'), score=0.0), pos=1, tagged_word=TaggedWord(text='stories', tag='DT'), score=0.0), pos=2, tagged_word=TaggedWord(text='about', tag='JJ'), score=0.0)]

In [17]:
def tag_sentences(dataset, tagger_params, model, tags):
    """
    Main predict function.
    Tags all sentences in dataset. Dataset is a list of TaggedSentence; while 
    tagging, ignore existing tags.
    """
    tagged_dataset = []
    for sentence in dataset:
        bst = BeamSearchTask(tagger_params, sentence, model, tags)
        stacks = beam_search(bst)
        best_hypo = stacks[-1][0]
        
        tagged_sentence = []
        for _ in range(len(sentence)):
            tagged_sentence.append(best_hypo.tagged_word)
            best_hypo = best_hypo.prev
        
        tagged_dataset.append(list(reversed(tagged_sentence)))
        
    return tagged_dataset

In [18]:
tag_sentences([tagged_sentence], tagger_params, model, tags)

[[TaggedWord(text='the', tag='DT'),
  TaggedWord(text='stories', tag='DT'),
  TaggedWord(text='about', tag='DT'),
  TaggedWord(text='well-heeled', tag='DT'),
  TaggedWord(text='communities', tag='DT'),
  TaggedWord(text='and', tag='DT'),
  TaggedWord(text='developers', tag='DT')]]

### Part VI: Optimization objective and algorithm

Once we defined our model and inference algorithm, we can define an optimization task: an object that computes loss function and its gradients w.r.t. model parameters.

In [19]:
class OptimizationTask:
    """
    Optimization task that can be used with sgd().
    """

    def params(self):
        """
        Parameters which are optimized in this optimization task.
        Return Value.
        """
        raise NotImplementedError()

    def loss_and_gradient(self, golden_sentence):
        """
        Return (loss, gradient) on a specific example.

        loss: float
        gradient: Update
        """
        raise NotImplementedError()


class UnstructuredPerceptronOptimizationTask(OptimizationTask):
    def __init__(self, tagger_params, tags):
        self.tagger_params = TaggerParams(src_window=tagger_params.src_window, dst_order=tagger_params.dst_order,
                                          max_suffix=tagger_params.max_suffix, beam_size=len(tags),
                                          nparams=tagger_params.nparams)
        self.model = LinearModel(tagger_params.nparams)
        self.tags = tags

    def params(self):
        return self.model.params()

    def loss_and_gradient(self, golden_sentence):
        beam_search_task = BeamSearchTask(self.tagger_params, golden_sentence, self.model, self.tags)
        golden_hypo = None
        feature_computer = FeatureComputer(self.tagger_params, golden_sentence)
        loss = 0
        grad = Update()
        for i in range(len(golden_sentence)):
            new_hypos = beam_search_task.expand(golden_hypo)
            
            rival_hypo = max(new_hypos, key=lambda x: x.score)
            golden_hypo = [x for x in new_hypos if x.tagged_word.tag == golden_sentence[i].tag][0]

            rival_features = feature_computer.compute_features(rival_hypo)
            grad.assign_madd(self.model.gradient(rival_features, score=None), 1)
            
            golden_features = feature_computer.compute_features(golden_hypo)
            grad.assign_madd(self.model.gradient(golden_features, score=None), -1)
            
            loss += rival_hypo.score - golden_hypo.score
            
        return loss, grad
        
class StructuredPerceptronOptimizationTask(OptimizationTask):
    def __init__(self, tagger_params, tags):
        self.tagger_params = tagger_params
        self.model = LinearModel(tagger_params.nparams)
        self.tags = tags

    def params(self):
        return self.model.params()

    def loss_and_gradient(self, golden_sentence):
        # Do beam search.
        beam_search_task = BeamSearchTask(self.tagger_params, golden_sentence, self.model, self.tags)
        stacks = beam_search(beam_search_task)

        # Compute chain of golden hypos (and their scores!).
        golden_hypo = None
        hypos = []
        feature_computer = FeatureComputer(self.tagger_params, golden_sentence)
        for i in range(len(golden_sentence)):
            new_golden_hypo = Hypo(prev=golden_hypo, pos=i, tagged_word=golden_sentence[i], score=0)
            new_hypo_score = self.model.score(feature_computer.compute_features(new_golden_hypo))
            golden_hypo = with_score(new_golden_hypo, new_hypo_score)
            hypos.append(golden_hypo)

        # Find where to update.
        golden_head = None
        rival_head = None
        for i, hypo in enumerate(hypos):
            if hypo.score < stacks[i][-1].score:
                golden_head = hypo
                rival_head = stacks[i][0]
        
        if golden_head is None and rival_head is None:
            golden_head = hypos[-1] 
            rival_head = stacks[-1][0]

        # Compute gradient.
        loss = 0
        grad = Update()
        while golden_head and rival_head:
            rival_features = feature_computer.compute_features(rival_head)
            grad.assign_madd(self.model.gradient(rival_features, score=None), 1)

            golden_features = feature_computer.compute_features(golden_head)
            grad.assign_madd(self.model.gradient(golden_features, score=None), -1)

            loss += rival_head.score - golden_head.score
            
            golden_head = golden_head.prev
            rival_head = rival_head.prev

        return loss, grad

In [20]:
UnstructuredPerceptronOptimizationTask(tagger_params, tags).loss_and_gradient(tagged_sentence)

(0.0,
 Update([ 5 12  4  1 14  3 10  7  2  9  8  6  0 13], [ 3.  4.  4. -2. -6.  3.  6. -8. -6.  3.  5.  1. -5. -2.]))

In [21]:
StructuredPerceptronOptimizationTask(tagger_params, tags).loss_and_gradient(tagged_sentence)

(0.0,
 Update([ 7  4  8  5 12  0  9 10  2  3 14  1  6 13], [-8.  4.  5.  3.  4. -5.  3.  6. -6.  3. -6. -2.  1. -2.]))

### Part VII: optimizer

By this point we can define a model with parameters $\theta$ and a problem that computes gradients $ \partial L \over \partial \theta $ w.r.t. model parameters.

Optimization is performed by gradient descent: $ \theta := \theta - \alpha {\partial L \over \partial \theta} $

In order to speed up training, we use stochastic gradient descent that operates on minibatches of data.

In [22]:
import copy
import time

SGDParams = collections.namedtuple('SGDParams', [
    'epochs',
    'learning_rate',
    'minibatch_size',
    'average' # bool or int
    ])


def make_batches(dataset, n_batches):
    """
    Make list of batches from a list of examples.
    """
    return np.array_split(np.random.permutation(dataset), n_batches)


def sgd(sgd_params, optimization_task, dataset, after_each_epoch_fn, after_each_batch_fn=None):
    """
    Run (averaged) SGD on a generic optimization task. Modify optimization
    task's parameters.

    After each epoch (and also before and after the whole training),
    run after_each_epoch_fn().
    """
    after_each_epoch_fn(stage='Initial quality')
    if sgd_params.average:
        params_sum = copy.deepcopy(optimization_task.params())
    
    n_batch = len(dataset)//sgd_params.minibatch_size + 1
    callback_params = {'n_epoch': sgd_params.epochs, 'n_batch': n_batch}
    for i in range(sgd_params.epochs):
        callback_params['epoch_i'] = i
        callback_params['epoch_start'] = time.time()
        for j, batch in enumerate(make_batches(dataset, n_batch)):
            callback_params['batch_i'] = j
            callback_params['batch_start'] = time.time()
            loss_sum = 0
            grad_sum = Update()
            for sentence in batch:
                loss, grad = optimization_task.loss_and_gradient(sentence)
                grad_sum.assign_madd(grad, 1)
                loss_sum += loss
            grad_avg = grad_sum
            grad_avg.assign_mul(1/sgd_params.minibatch_size)
            loss_avg = loss_sum/sgd_params.minibatch_size
            optimization_task.params().assign_madd(grad_avg, -sgd_params.learning_rate)
            if sgd_params.average:
                params_sum.assign_madd(optimization_task.params(), 1)
            callback_params['batch_end'] = time.time()
            callback_params['batch_loss'] = loss_avg
            if after_each_batch_fn:
                after_each_batch_fn(**callback_params)
        callback_params['epoch_end'] = time.time()
        after_each_epoch_fn(**callback_params)
    if sgd_params.average:
        params_avg = params_sum
        params_avg.assign_mul(1/(sgd_params.epochs*(len(dataset)//sgd_params.minibatch_size + 1)))
        optimization_task.params().assign(params_avg)
    after_each_epoch_fn(stage='Final quality')


### Part VIII: Training loop

The train function combines everthing you used below to get new 

In [23]:
import os
import pprint
import pickle

def train(
    tags='./data/tags',
    train_dataset='./data/en-ud-train.conllu',
    dev_dataset='./data/en-ud-dev.conllu',
    model_name='./model.npz',
    
    sgd_epochs=15,
    sgd_learning_rate=0.01,
    sgd_minibatch_size=32,
    sgd_average=True,
    
    # Number of context tags in output tagging to use for features
    tagger_src_window=2,
    
    # Number of context tags in output tagging to use for features
    tagger_dst_order=3,
    
    # Maximal number of prefix/suffix letters to use for features
    tagger_max_suffix=4,
    
    # Width for beam search (0 means unstructured)
    beam_size=1,
    
    # Parameter vector size (for hashing)
    nparams= 2 ** 22,
):
    """ Train a pos-tagger model and save it's parameters to :model: """

    # Beam size.
    optimization_task_cls = StructuredPerceptronOptimizationTask
    if beam_size == 0:
        beam_size = 1
        optimization_task_cls = UnstructuredPerceptronOptimizationTask

    # Parse cmdargs.
    tags = read_tags(tags)
    train_dataset = read_tagged_sentences(train_dataset)
    dev_dataset = read_tagged_sentences(dev_dataset)
    params = None
    if os.path.exists(model_name):
        params = pickle.load(open(model_name, 'rb'))
    sgd_params = SGDParams(
        epochs=sgd_epochs,
        learning_rate=sgd_learning_rate,
        minibatch_size=sgd_minibatch_size,
        average=sgd_average
        )
    tagger_params = TaggerParams(
        src_window=tagger_src_window,
        dst_order=tagger_dst_order,
        max_suffix=tagger_max_suffix,
        beam_size=beam_size,
        nparams=nparams
        )

    # Load optimization task
    optimization_task = optimization_task_cls(tagger_params, tags)
    if params is not None:
        print(f'Loading parameters from {model_name}')
        optimization_task.params().assign(params)

    # Validation.
    def after_each_epoch_fn(*args, **kwargs):
        model = LinearModel(nparams)
        model.params().assign(optimization_task.params())
        tagged_sentences = tag_sentences(dev_dataset, tagger_params, model, tags)
        quality = tagging_quality(out=tagged_sentences, ref=dev_dataset)
        
        if 'stage' in kwargs:
            stage = kwargs["stage"]
            print(f'\r{stage}: {quality}')
        else:
            epoch_i = kwargs['epoch_i']
            n_epoch = kwargs['n_epoch']
            start = kwargs['epoch_start']
            end = kwargs['epoch_end']
            print(f'\rEpoch #{epoch_i + 1}/{n_epoch} ({get_batch_string(*args, **kwargs)}), epoch_quality: {quality:.3f}, epoch_time: {(end - start)/60:.3f} min')
        
        pickle.dump(optimization_task.params(), open(model_name, 'wb'))
    
    def after_each_batch_fn(*args, **kwargs):
        epoch_i = kwargs['epoch_i']
        n_epoch = kwargs['n_epoch']
        print(f'\rEpoch #{epoch_i + 1}/{n_epoch} ({get_batch_string(*args, **kwargs)})', end='')
    
    def get_batch_string(*args, **kwargs):
        start = kwargs['batch_start']
        end = kwargs['batch_end']
        batch_i = kwargs['batch_i']
        n_batch = kwargs['n_batch']
        batch_loss = kwargs['batch_loss']
        return f'batch #{batch_i + 1}/{n_batch}, batch_loss: {batch_loss:.5f}, batch_time: {end - start:.3f} s'
    
    # Run SGD.
    sgd(sgd_params, optimization_task, train_dataset, after_each_epoch_fn, after_each_batch_fn)


In [26]:
# train a model with default params
train(model_name='./model_default.npz')

Initial quality: 0.16280971461280563
Epoch #1/15 (batch #392/392, batch_loss: 12.74798, batch_time: 2.121 s), epoch_quality: 0.805, epoch_time: 7.134 min
Epoch #2/15 (batch #392/392, batch_loss: 10.13711, batch_time: 1.406 s), epoch_quality: 0.918, epoch_time: 8.568 min
Epoch #3/15 (batch #392/392, batch_loss: 12.79489, batch_time: 1.263 s), epoch_quality: 0.923, epoch_time: 10.028 min
Epoch #4/15 (batch #392/392, batch_loss: 13.49482, batch_time: 2.703 s), epoch_quality: 0.922, epoch_time: 8.920 min
Epoch #5/15 (batch #392/392, batch_loss: 6.41115, batch_time: 0.795 s), epoch_quality: 0.919, epoch_time: 6.542 min
Epoch #6/15 (batch #392/392, batch_loss: 10.99705, batch_time: 0.934 s), epoch_quality: 0.925, epoch_time: 6.416 min
Epoch #7/15 (batch #392/392, batch_loss: 4.24250, batch_time: 0.619 s), epoch_quality: 0.928, epoch_time: 6.319 min
Epoch #8/15 (batch #392/392, batch_loss: 7.29345, batch_time: 0.766 s), epoch_quality: 0.931, epoch_time: 6.386 min
Epoch #9/15 (batch #392/392, 

### Part IX: Evaluate the trained model

In [36]:
import sys

def test(
    tags='./data/tags',
    dataset='./data/en-ud-dev.conllu',
    model='./model.npz',
    
    # model and inference params; see train for their description
    tagger_src_window=2,
    tagger_dst_order=3,
    tagger_max_suffix=4,
    beam_size=1,
    nparams= 2 ** 22,
    out_file='./result.conllu',
    print_quality=True
):


    tags = read_tags(tags)
    dataset = read_tagged_sentences(dataset)
    params = pickle.load(open(model, 'rb'))
    tagger_params = TaggerParams(
        src_window=tagger_src_window,
        dst_order=tagger_dst_order,
        max_suffix=tagger_max_suffix,
        beam_size=beam_size,
        nparams=nparams
        )

    # Load model.
    model = LinearModel(params.values.shape[0])
    model.params().assign(params)

    # Tag all sentences.
    tagged_sentences = tag_sentences(dataset, tagger_params, model, tags)

    # Write tagged sentences.
    with open(out_file, 'w+', encoding='utf8') as f:
        for tagged_sentence in tagged_sentences:
            write_tagged_sentence(tagged_sentence, f)

    # Measure and print quality.
    if print_quality:
        q = pprint.pformat(tagging_quality(out=tagged_sentences, ref=dataset))
        print(q, file=sys.stderr)


In [28]:
# test 
test(model='./model_default.npz', out_file='./result_default.conllu')

# sanity chec: accuracy > 90%.

0.9403058304031401


In [37]:
# tagging
test(model='./model_default.npz', dataset='./data/en-ud-test-notags.conllu', out_file='./en-ud-test-tagged.conllu', print_quality=False)

### Part X: play with it

_This part is optional_

Once you've built something, it's only natural to test the limits of your contraption.

At minumum, we want you to find out how default model accuracy depends on __beam size__

To get maximum points, your model should get final quality >= 93% 

Any further analysis is welcome, as always.

In [None]:
train(model_name='./model_unstructured.npz', beam_size=0)

In [31]:
test(model='./model_unstructured.npz', out_file='./result_unstructured.conllu', beam_size=1)

0.9318014555564641


In [32]:
# playing with it takes too long
train(model_name='./model_beam2.npz', beam_size=2)

Initial quality: 0.16280971461280563
Epoch #1/15 (batch #392/392, batch_loss: 8.92628, batch_time: 1.859 s), epoch_quality: 0.851, epoch_time: 12.831 min
Epoch #2/15 (batch #392/392, batch_loss: 8.21362, batch_time: 1.717 s), epoch_quality: 0.918, epoch_time: 12.337 min
Epoch #3/15 (batch #392/392, batch_loss: 6.90967, batch_time: 1.355 s), epoch_quality: 0.926, epoch_time: 11.991 min
Epoch #4/15 (batch #392/392, batch_loss: 14.82606, batch_time: 2.062 s), epoch_quality: 0.753, epoch_time: 11.306 min
Epoch #5/15 (batch #388/392, batch_loss: 11.47744, batch_time: 1.706 s)

KeyboardInterrupt: 

In [33]:
test(model='./model_beam2.npz', out_file='./result_beam2.conllu', beam_size=2)

0.7530460381061411
