# Homework 5. Sequence Tagging with LSTM

Welcome to Homework 5! 

The homework contains several tasks. You can find the amount of points that you get for the correct solution in the task header. Maximum amount of points for each homework is _six_.

The **grading** for each task is the following:
- correct answer - **full points**
- insufficient solution or solution resulting in the incorrect output - **half points**
- no answer or completely wrong solution - **no points**

Even if you don't know how to solve the task, we encourage you to write down your thoughts and progress and try to address the issues that stop you from completing the task.

When working on the written tasks, try to make your answers short and accurate. Most of the times, it is possible to answer the question in 1-3 sentences.

When writing code, make it readable. Choose appropriate names for your variables (`a = 'cat'` - not good, `word = 'cat'` - good). Avoid constructing lines of code longer than 100 characters (79 characters is ideal). If needed, provide the commentaries for your code, however, a good code should be easily readable without them :)

Finally, all your answers should be written only by yourself. If you copy them from other sources it will be considered as an academic fraud. You can discuss the tasks with your classmates but each solution must be individual.

<font color='red'>**Important!:**</font> **before sending your solution, do the `Kernel -> Restart & Run All` to ensure that all your code works.**

## Task 1. Download resourses for your language (0.5 points)

In this homework, you are going to improve the pos tagger for your native language that we have built during the Lab. In particular, you are going to add a character level model to capture the inner structure of the word. This should help to better predict a correct tag. If there are no available resources for your language, you can choose any other language. 

__What is your native language?__

<font color='red'>Your answer here</font>

To start with, import all the packages below.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.sampler import SubsetRandomSampler
from torch.nn.utils.rnn import pad_sequence, pack_sequence, pack_padded_sequence, pad_packed_sequence, PackedSequence
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import random
import numpy as np
from time import time
from datetime import datetime

from pathlib import Path
from collections import Counter

from typing import List, Dict

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Get the pretrained word vectors for your native language from [Fasttext](https://fasttext.cc/docs/en/crawl-vectors.html). For this model, you need to choose __text__ vectors (not bin!). If you don't have them available for your language, you can choose any other language that you want. Put the link instead of the `...` in the cell below.

__Are there FastText vectors available for your language? If yes, provide the link to it below.__

<font color='red'>Your answer here</font>

In [None]:
!wget ...

Replace `...` with the filename of your vector in the first line. Put the same name but without a `.gz` extension instead of the `...` in the last line.

In [None]:
!gunzip ...
!mkdir vector_cache/
!mv ... vector_cache/

Next, we need the data to train on. For this task, we are going to use [Universal Dependencies](https://universaldependencies.org/) data. It has labelled corpora for morphological tagging and syntax parsing for over than 70 languages. You need to choose your language from the official UD page, choose the treebank that you like and follow the GitHub link to it. Then, from GitHub, copy the link from the green "Clone or download" button and replace it in the cell below. 

Also, replace the name of your treebank in the `!mv` command.

For example, if I choose the EDT treebank for Estonian from [here](https://universaldependencies.org/#estonian-treebanks), the GitHub link is going to be `https://github.com/UniversalDependencies/UD_Estonian-EDT.git` and the name of the treebank is `UD_Estonian-EDT`, which is the name of the repository.

__Is there a UD treebank available for your language? If yes, provide the link to it below.__

<font color='red'>Your answer here</font>

Replace the `...` with the GitHub link to the repository that you've chosen.

In [None]:
!git clone ...

Replace the `...` with the name of the treebank that you've chosen.

In [None]:
!mkdir data/
!mv ... data/

This part is moslty the same as in the [Lab 6](). 

**Don't forget to change the `VEC_PATH` and `DATA_PATH` variables to match your data!**

In [None]:
PAD = '<PAD>'
PAD_ID = 0
UNK = '<UNK>'
UNK_ID = 1

VOCAB_PREFIX = [PAD, UNK]

VEC_PATH = Path('vector_cache') / '...'
DATA_PATH = Path('data') / '...'
MAX_VOCAB = 25000

batch_size = 64
validation_split = .3
shuffle_dataset = True
random_seed = 42

You can consult the Lab materials if you have any questions about the vocab classes.

In [None]:
class BaseVocab:
    def __init__(self, data, idx=0, lower=False):
        self.data = data
        self.lower = lower
        self.idx = idx
        self.build_vocab()
        
    def normalize_unit(self, unit):
        if self.lower:
            return unit.lower()
        else:
            return unit
        
    def unit2id(self, unit):
        unit = self.normalize_unit(unit)
        if unit in self._unit2id:
            return self._unit2id[unit]
        else:
            return self._unit2id[UNK]
    
    def id2unit(self, id):
        return self._id2unit[id]
    
    def map(self, units):
        return [self.unit2id(unit) for unit in units]

    def unmap(self, ids):
        return [self.id2unit(idx) for idx in ids]
        
    def build_vocab(self):
        NotImplementedError()
        
    def __len__(self):
        return len(self._unit2id)

In [None]:
class PretrainedWordVocab(BaseVocab):
    def build_vocab(self):
        self._id2unit = VOCAB_PREFIX + self.data
        self._unit2id = {w:i for i, w in enumerate(self._id2unit)}

In [None]:
class WordVocab(BaseVocab):
    def build_vocab(self):
        if self.lower:
            counter = Counter([w[self.idx].lower() for sent in self.data for w in sent])
        else:
            counter = Counter([w[self.idx] for sent in self.data for w in sent])

        self._id2unit = VOCAB_PREFIX + list(sorted(list(counter.keys()), key=lambda k: counter[k], reverse=True))
        self._unit2id = {w:i for i, w in enumerate(self._id2unit)}

Here, we introduce a character vocab that is going to store the mappings for individual characters rather than words.

In [None]:
class CharVocab(BaseVocab):
    def build_vocab(self):
        counter = Counter([c for sent in self.data for w in sent for c in w[self.idx]])
        self._id2unit = VOCAB_PREFIX + list(sorted(list(counter.keys()), key=lambda k: (counter[k], k), reverse=True))
        self._unit2id = {w:i for i, w in enumerate(self._id2unit)}

You can consult the Lab materials if you have any questions about building the dataset.

In [None]:
class Pretrain:
    def __init__(self, vec_filename, max_vocab=-1):
        self._vec_filename = vec_filename
        self._max_vocab = max_vocab
        
    @property
    def vocab(self):
        if not hasattr(self, '_vocab'):
            self._vocab, self._emb = self.read()
        return self._vocab
    
    @property
    def emb(self):
        if not hasattr(self, '_emb'):
            self._vocab, self._emb = self.read()
        return self._emb
        
    def read(self):
        if self._vec_filename is None:
            raise Exception("Vector file is not provided.")
        print(f"Reading pretrained vectors from {self._vec_filename}...")
        
        words, emb, failed = self.read_from_file(self._vec_filename, open_func=open)
        
        if failed > 0: # recover failure
            emb = emb[:-failed]
        if len(emb) - len(VOCAB_PREFIX) != len(words):
            raise Exception("Loaded number of vectors does not match number of words.")
            
        # Use a fixed vocab size
        if self._max_vocab > len(VOCAB_PREFIX) and self._max_vocab < len(words):
            words = words[:self._max_vocab - len(VOCAB_PREFIX)]
            emb = emb[:self._max_vocab]
                
        vocab = PretrainedWordVocab(words, lower=True)
        
        return vocab, emb
        
    def read_from_file(self, filename, open_func=open):
        """
        Open a vector file using the provided function and read from it.
        """
        first = True
        words = []
        failed = 0
        with open_func(filename, 'rb') as f:
            for i, line in enumerate(f):
                try:
                    line = line.decode()
                except UnicodeDecodeError:
                    failed += 1
                    continue
                if first:
                    # the first line contains the number of word vectors and the dimensionality
                    first = False
                    line = line.strip().split(' ')
                    rows, cols = [int(x) for x in line]
                    emb = np.zeros((rows + len(VOCAB_PREFIX), cols), dtype=np.float32)
                    continue

                line = line.rstrip().split(' ')
                emb[i+len(VOCAB_PREFIX)-1-failed, :] = [float(x) for x in line[-cols:]]
                words.append(' '.join(line[:-cols]))
        return words, emb, failed

In [None]:
FIELD_NUM = 10

class Word:
    def __init__(self, word: List[str]):
        self._id = word[0]
        self._text = word[1]
        self._lemma = word[2]
        self._upos = word[3]
        self._xpos = word[4]
        self._feats = word[5]
        self._head = word[6]
        self._deprel = word[7]
        self._deps = word[8]
        self._misc = word[9]

    @property
    def id(self):
        return self._id

    @property
    def text(self):
        return self._text

    @property
    def lemma(self):
        return self._lemma

    @property
    def upos(self):
        return self._upos

    @property
    def xpos(self):
        return self._xpos

    @property
    def feats(self):
        return self._feats

    @property
    def head(self):
        return self._head

    @property
    def deprel(self):
        return self._deprel

    @property
    def deps(self):
        return self._deps

    @property
    def misc(self):
        return self._misc


class Sentence:
    def __init__(self, words: List[List[str]]):
        self._words = [Word(w) for w in words]

    @property
    def words(self):
        return self._words

class Document:
    def __init__(self, file_path):
        self._sentences = []
        self.load_conll(open(file_path, encoding='utf-8'))


    def load_conll(self, f, ignore_gapping=True):
        """ Load the file or string into the CoNLL-U format data.
        Input: file or string reader, where the data is in CoNLL-U format.
        Output: a list of list of list for each token in each sentence in the data, where the innermost list represents 
        all fields of a token.

        Taken and modified from Stanza: https://github.com/stanfordnlp/stanza/blob/master/stanza/utils/conll.py
        Stanza is released under the Apache License, Version 2.0.
        """
        # f is open() or io.StringIO()
        doc, sent = [], []
        for line in f:
            line = line.strip()
            if len(line) == 0:
                if len(sent) > 0:
                    doc.append(Sentence(sent))
                    sent = []
            else:
                if line.startswith('#'): # skip comment line
                    continue
                array = line.split('\t')
                if ignore_gapping and '.' in array[0]:
                    continue
                assert len(array) == FIELD_NUM, \
                        f"Cannot parse CoNLL line: expecting {FIELD_NUM} fields, {len(array)} found."
                sent += [array]
        if len(sent) > 0:
            doc.append(Sentence(sent))
        self._sentences = doc

    
    @property
    def sentences(self):
        return self._sentences


    def get(self, fields, as_sentences=False):
        """Taken and modified from Stanza: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/common/doc.py
        Stanza is released under the Apache License, Version 2.0.
        """
        assert isinstance(fields, list), "Must provide field names as a list."
        assert len(fields) >= 1, "Must have at least one field."

        results = []
        for sentence in self.sentences:
            cursent = []
            units = sentence.words
            for unit in units:
                if len(fields) == 1:
                    cursent += [getattr(unit, fields[0])]
                else:
                    cursent += [[getattr(unit, field) for field in fields]]

            # decide whether append the results as a sentence or a whole list
            if as_sentences:
                results.append(cursent)
            else:
                results += cursent
        return results

For the dataset, we are going to add the new `CharVocab` and preprocess each word character by character with it. For example, if you have a character vocabulary like this:

`{'a': 0, 'b': 1, ..., 'y': 24, 'z': 25, 'A': 26, 'B': 27, ..., 'Y': 50, 'Z': 51}`

Then a sentence `['I', 'like', 'cats']` is going to be transformed into `[[35], [11, 8, 10, 4], [2, 0, 19, 18]]`.

In [None]:
class CONLLUDataset(Dataset):
    def __init__(self, doc: Document, pretrain: Pretrain, 
                 vocab: Dict[str, BaseVocab] = None, test: bool = False):
        self.pretrain_vocab = pretrain.vocab
        self.test = test
        data = self.load_doc(doc)

        if vocab is None:
            self.vocab = self.init_vocab(data)
        else:
            self.vocab = vocab

        self.data = self.preprocess(data, self.vocab, self.pretrain_vocab)

    def init_vocab(self, data: List) -> Dict:
        wordvocab = WordVocab(data, idx=0)
        charvocab = CharVocab(data, idx=0)
        uposvocab = WordVocab(data, idx=1)
        vocab = {
            'word': wordvocab,
            'char': charvocab,
            'upos': uposvocab}
        return vocab

    def preprocess(self, data: List, vocab: Dict[str, BaseVocab], 
                   pretrain_vocab: PretrainedWordVocab) -> List[List[int]]:
        processed = []
        for sent in data:
            processed_sent = [vocab['word'].map([w[0] for w in sent])]
            processed_sent += [[vocab['char'].map([char for char in w[0]]) for w in sent]]
            processed_sent += [vocab['upos'].map([w[1] for w in sent])]
            processed_sent += [pretrain_vocab.map([w[0].lower() for w in sent])]
            processed.append(processed_sent)
        return processed
        
    def load_doc(self, doc: Document) -> List:
        data = doc.get(['text', 'upos'], as_sentences=True)
        return data
            
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

In [None]:
pretrain = Pretrain(VEC_PATH, MAX_VOCAB)

Put the correct path to your train file here.

In [None]:
train_doc = Document(DATA_PATH / '...')
train_dataset = CONLLUDataset(train_doc, pretrain)

Put the correct path to your dev file here.

In [None]:
vocab = train_dataset.vocab
dev_doc = Document(DATA_PATH / '...')
dev_dataset = CONLLUDataset(dev_doc, pretrain, vocab=vocab, test=True)

You can look inside the first sentence to see how the preprocessed data looks.

In [None]:
train_dataset[0]

We are going to pad the characters and save the original lengths for each word in a sentence to reconstruct the correct order later.

In [None]:
def pad_collate(batch):
    (sents, chars, upos, pretrained) = zip(*batch)

    sent_lens = [len(s) for s in sents]
    word_lens = [len(c) for w in chars for c in w]

    sents = [torch.LongTensor(w).to(device) for w in sents]
    chars = [torch.LongTensor(c).to(device) for w in chars for c in w]
    upos = [torch.LongTensor(u).to(device) for u in upos]
    pretrained = [torch.LongTensor(p).to(device) for p in pretrained]

    sent_pad = pad_sequence(sents, batch_first=True, padding_value=PAD_ID)
    chars_pad = pad_sequence(chars, batch_first=True, padding_value=PAD_ID)
    upos_pad = pad_sequence(upos, batch_first=True, padding_value=PAD_ID)
    pretrained_pad = pad_sequence(pretrained, batch_first=True, padding_value=PAD_ID)

    sent_pad = sent_pad.to(device)
    chars_pad = chars_pad.to(device)
    upos_pad = upos_pad.to(device)
    pretrained_pad = pretrained_pad.to(device)

    return sent_pad, chars_pad, upos_pad, pretrained_pad, sent_lens, word_lens

In [None]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle_dataset, collate_fn=pad_collate)

In [None]:
dev_loader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=shuffle_dataset, collate_fn=pad_collate)

## Task 2. Build a character-level LSTM model (3 points)

You already know that we can have a vector representation of a word that is learned from its context. We can use these vectors to capture sematical relations between the words. 

In addition to that we can learn an inner representation of a word, or its character-level embedding. This can help to capture morphological information about a word.

We can do it by building another LSTM model and taking its last hidden state which is going to be the character-level representation of a word. The input to the model is going to be a stream for characters for each word in a batch.

### Task 2.1. Define an Embedding layer (0.5 points)

Create an [`nn.Embedding`](https://pytorch.org/docs/stable/nn.html?highlight=embedding#torch.nn.Embedding) layer. The input size, or the number of embeddings, should be the size of our vocabulary (`CONLLUDataset` class shows how to access them). The output size, or the size of each vector, is `char_emb_dim`. Padding index is `0`. 

### Task 2.2. Define an LSTM layer (0.5 points)

Create an [`nn.LSTM`](https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM) layer. The input size should be the output size of the embedding layer. The hidden size is `char_hidden_dim`. The number of layers is `char_num_layers`. Set the `batch_first` parameter to `True`. Make the droupout to be `0` if number of layers is `1` or `droupout` otherwise.

### Task 2.3. Embed the input (0.5 points)

Now, you will implement a forward pass.

Run the character input (`chars_pad`) through the embedding layer. Apply the dropout to the embeddings and save the result into the `char_emb` variable.

### Task 2.4. Apply the LSTM layer (0.5 points)

Put the embeddings into the LSTM layer. You are going to save only the last hidden state that is going to be the representation of the word.

In [None]:
class CharLSTM(nn.Module):
    def __init__(self, vocab: Dict[str, BaseVocab], char_emb_dim: int,
                 char_hidden_dim: int, char_num_layers: int, dropout: float):
        super().__init__()

        ### Task 2.1 starts here ###
        self.char_emb = ...
        ### Task 2.1 ends here ###

        ### Task 2.2 starts here ###
        self.char_lstm = ...
        self.char_lstm_h_init = nn.Parameter(torch.zeros(char_num_layers, 1, char_hidden_dim))
        self.char_lstm_c_init = nn.Parameter(torch.zeros(char_num_layers, 1, char_hidden_dim))
        ### Task 2.2 ends here ###

        self.dropout = nn.Dropout(dropout)

    def forward(self, chars_pad, sent_lens, word_lens):
        ### Task 2.3 starts here ###
        char_emb = ...
        ### Task 2.3 ends here ###

        batch_size = char_emb.size(0)
        char_emb = pack_padded_sequence(char_emb, word_lens, batch_first=True, enforce_sorted=False)

        ### Task 2.4 starts here ###
        _, (h, _) = self.char_lstm(
            ...,
            (self.char_lstm_h_init.expand(char_num_layers, batch_size, char_hidden_dim).contiguous(),
             self.char_lstm_c_init.expand(char_num_layers, batch_size, char_hidden_dim).contiguous())
        )
        ### Task 2.4 ends here ###

        # Remove an empty dimension
        result = h.squeeze(0)
        # Chunk the output back into words
        result = pack_sequence(result.split(sent_lens), enforce_sorted=False)

        return result

### Task 2.5. Incorporate the CharLSTM into the Tagger model (0.5 points)

Now your new model should have the following architecture:

![img](https://github.com/501Good/tartu-nlp-2020/blob/master/homeworks/hw4/img1.png?raw=1)

Initialize the CharLSTM model that you've just created. Pass `vocab`, `char_emb_dim`, `char_hidden_dim`, `char_num_layers`, `dropout` to the class constructor. Save it to `self.char_model`.

Create another [`nn.Linear`](https://pytorch.org/docs/stable/nn.html?highlight=linear#torch.nn.Linear) layer to transform the character representations. It should work similar to the linear layer for the transformation of the pretrained vectors. The input size should be `char_hidden_dim` and the output size is `transformed_dim`. Also, set the `bias` parameter to `False`.

Add the `transformed_dim` to the `input_size`.

### Task 2.6. Get the character embeddings in the forward pass (0.5 points)

Perform the forward pass on your CharLSTM and save the results into the `char_emb` variable. Pass all the appropriate parameters to the CharLSTM.

Apply the dropout to the `char_emb`. After that transform it with the `self.trans_char` linear layer. 

**NB!** Since the output of the CharLSTM is a `PackedSequence` object, you need to apply the dropout to `char_emb.data`.

In [None]:
class Tagger(nn.Module):
    def __init__(self, vocab: Dict[str, BaseVocab], word_emb_dim: int,
                 char_emb_dim: int, transformed_dim: int, emb_matrix: np.ndarray,
                 hidden_dim: int, char_hidden_dim: int,
                 upos_clf_hidden_dim: int, num_layers: int, char_num_layers: int,
                 dropout: float):
        super().__init__()

        self.vocab = vocab
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim

        input_size = 0

        self.word_emb = nn.Embedding(len(vocab['word']), word_emb_dim, padding_idx=0)
        input_size += word_emb_dim

        ### Task 2.5 starts here ###
        self.char_model = ...
        self.trans_char = ...
        input_size += transformed_dim
        ### Task 2.5 ends here ###

        self.pretrained_emb = nn.Embedding.from_pretrained(torch.from_numpy(emb_matrix), freeze=True)
        self.trans_pretrained = nn.Linear(emb_matrix.shape[1], transformed_dim, bias=False)
        input_size += transformed_dim

        self.lstm = nn.LSTM(input_size, hidden_dim, num_layers, batch_first=True, dropout=dropout, bidirectional=True)
        self.lstm_h_init = nn.Parameter(torch.zeros(2 * num_layers, 1, hidden_dim))
        self.lstm_c_init = nn.Parameter(torch.zeros(2 * num_layers, 1, hidden_dim))

        self.upos_hid = nn.Linear(2* hidden_dim, upos_clf_hidden_dim)
        self.upos_clf = nn.Linear(upos_clf_hidden_dim, len(vocab['upos']))

        self.crit = nn.CrossEntropyLoss(ignore_index=0)

        self.drop = nn.Dropout(dropout)

    
    def forward(self, sent_pad, chars_pad, upos_pad, pretrained_pad, sent_lens, word_lens):
        inputs = []

        word_emb = self.word_emb(sent_pad)
        inputs += [word_emb]

        pretrained_emb = self.pretrained_emb(pretrained_pad)
        pretrained_emb = self.trans_pretrained(pretrained_emb)
        inputs += [pretrained_emb]

        ### Task 2.6 starts here ###
        char_emb = ...
        char_emb_trans = ...
        ### Task 2.6 ends here ###
        # Creating a PackedSequence from the embeddings to restore the original order
        char_emb = PackedSequence(char_emb_trans, char_emb.batch_sizes, 
                                  char_emb.sorted_indices, char_emb.unsorted_indices)
        char_emb = pad_packed_sequence(char_emb, batch_first=True)[0]
        inputs += [char_emb]

        lstm_inputs = torch.cat([x for x in inputs], 2)
        lstm_inputs = self.drop(lstm_inputs)
        lstm_inputs = pack_padded_sequence(lstm_inputs, sent_lens, batch_first=True, enforce_sorted=False)

        lstm_outputs, _ = self.lstm(
            lstm_inputs, 
            (self.lstm_h_init.expand(2 * self.num_layers, sent_pad.size(0), self.hidden_dim).contiguous(), 
             self.lstm_c_init.expand(2 * self.num_layers, sent_pad.size(0), self.hidden_dim).contiguous())
        )
        lstm_outputs = lstm_outputs.data

        upos_hid = F.relu(self.upos_hid(self.drop(lstm_outputs)))
        upos_pred = self.upos_clf(self.drop(upos_hid))

        pred = PackedSequence(upos_pred, lstm_inputs.batch_sizes,
                              lstm_inputs.sorted_indices, lstm_inputs.unsorted_indices)
        pred = pad_packed_sequence(pred, batch_first=True)[0]
        pred = pred.max(2)[1]
        upos = pack_padded_sequence(upos_pad, sent_lens, batch_first=True, enforce_sorted=False).data
        loss = self.crit(upos_pred, upos)

        return loss, pred

In [None]:
class Trainer:
    def __init__(self, vocab, word_emb_dim, char_emb_dim, transformed_dim,
                 emb_matrix, hidden_dim, char_hidden_dim, upos_clf_hidden_dim, 
                 num_layers, char_num_layers, dropout):
        self.vocab = vocab
        self.model = Tagger(vocab, word_emb_dim, char_emb_dim, transformed_dim, 
                            emb_matrix, hidden_dim, char_hidden_dim, 
                            upos_clf_hidden_dim, num_layers, char_num_layers, 
                            dropout)
        self.parameters = [p for p in self.model.parameters() if p.requires_grad]

        self.model.to(device)

        self.optimizer = torch.optim.Adam(self.parameters)
 
    def update(self, batch, eval=False):
        sent_pad, chars_pad, upos_pad, pretrained_pad, sent_lens, word_lens = batch

        if eval:
            self.model.eval()
        else:
            self.model.train()
            self.optimizer.zero_grad()

        loss, _ = self.model(sent_pad, chars_pad, upos_pad, pretrained_pad, sent_lens, word_lens)
        loss_val = loss.data.item()
        if eval:
            return loss_val

        loss.backward()
        self.optimizer.step()

        return loss_val


    def predict(self, batch):
        sent_pad, chars_pad, upos_pad, pretrained_pad, sent_lens, word_lens = batch

        self.model.eval()
        batch_size = sent_pad.size(0)
        _, pred = self.model(sent_pad, chars_pad, upos_pad, pretrained_pad, sent_lens, word_lens)
        # Transform the indices to the pos tags
        pred = [self.vocab['upos'].unmap(sent) for sent in pred.tolist()]
        # Trim the predictions to their original lengths
        pred_tokens = [[pred[i][j] for j in range(sent_lens[i])] for i in range(batch_size)]

        gold_upos = [vocab['upos'].unmap(upos) for upos in [upos for upos in upos_pad]]
        gold_tokens = [[gold_upos[i][j] for j in range(sent_lens[i])] for i in range(batch_size)]

        return pred_tokens, gold_tokens

In [None]:
word_emb_dim = 75
char_emb_dim = 100
transformed_dim = 125
emb_matrix = pretrain.emb
hidden_dim = 200
char_hidden_dim = 400
upos_clf_hidden_dim = 400
num_layers = 2
char_num_layers = 1
dropout = 0.5

In [None]:
trainer = Trainer(vocab, word_emb_dim, char_emb_dim, transformed_dim,
                  emb_matrix, hidden_dim, char_hidden_dim, upos_clf_hidden_dim,
                  num_layers, char_num_layers, dropout)

In [None]:
global_step = 0
max_steps = 50000
dev_score_history = []
format_str = '{}: step {}/{}, loss = {:.6f} ({:.3f} sec/batch)'
last_best_step = 0

log_step = 20
eval_interval = 100

## Task 3. Train your model (0.5 points)

Run the cell below to start training the model. After each 100 steps, it is going to print out the average training loss and dev score, which is a simple accuracy in our case. You should see the training loss decreasing and the dev score increasing.

Train the model until you don't see the increase in the dev score anymore. Report the score that you've got.

__My maximum dev_score is:__

<font color='red'>Your answer here</font>

In [None]:
train_loss = 0
while True:
    do_break = False
    for batch in train_loader:
        start_time = time()
        global_step += 1
        loss = trainer.update(batch, eval=False)
        train_loss += loss

        if global_step % log_step == 0:
            duration = time() - start_time
            print(format_str.format(datetime.now().strftime("%Y-%m-%d %H:%M:%S"), global_step,\
                    max_steps, loss, duration))
            
        if global_step % eval_interval == 0:
            print("Evaluating on dev set...")
            dev_preds = []
            dev_words = []
            dev_correct = 0
            dev_total = 0
            for batch in dev_loader:
                batch_size = batch[0].size(0)
                preds, gold = trainer.predict(batch)
                dev_correct += sum([1 for sent in zip(preds, gold) for pair in zip(*sent) if pair[0] == pair[1]])
                dev_total += sum([len(sent) for sent in gold])
                dev_preds += preds
                # Keep the original sentence
                pred_sents = [[batch[0][i][j] for j in range(batch[4][i])] for i in range(batch_size)]
                dev_words += [vocab['word'].unmap(sent) for sent in [sent for sent in pred_sents]]
            
            dev_score = dev_correct / dev_total
            train_loss = train_loss / eval_interval
            print("step {}: train_loss = {:.6f}, dev_score = {:.6f}".format(global_step, train_loss, dev_score))
            # Shows one prediction for a sanity check
            print(f"Preds: {list(zip(dev_preds[0], dev_words[0]))}")
            train_loss = 0

        if global_step >= max_steps:
            do_break = True
            break

        if do_break:
            break

### Task 4. Error Analysis (1 point)

Let's evaluate the model on the test set. First you need to load in the testing data. Create test_doc, test_dataset and test_loader. Evaluate the trainer on the test set.


Create a confusion matrix, display it in readable format.
Now, when we have the confusion matrix, let's calculate accuracy for each POS tag separately. 

Lastly, look at the confusion matrix and accuracy for each POS tag and describe what issues you can see. 



In [None]:
test_doc = ...
test_dataset = ...
test_loader = ....

In [None]:
test_preds = []
test_words = []
test_golds = []
test_correct = 0
test_total = 0
for batch in test_loader:
    batch_size = batch[0].size(0)
    preds, gold = trainer.predict(batch)
    test_golds += gold
    test_correct += sum([1 for sent in zip(preds, gold) for pair in zip(*sent) if pair[0] == pair[1]])
    test_total += sum([len(sent) for sent in gold])
    test_preds += preds
    pred_sents = [[batch[0][i][j] for j in range(batch[4][i])] for i in range(batch_size)]
    test_words += [vocab['word'].unmap(sent) for sent in [sent for sent in pred_sents]]
test_score = test_correct / test_total
print(test_score)

In [None]:
# Create the confusion matrix, you can use the sklearn confusion matrix 

from sklearn.metrics import confusion_matrix


In [None]:
# Calculate accuracy for each POS tag separately using confusion matrix


__What were the issues that you can see from the confusion matrix and calculated accuracies?__

<font color='red'>Your answer here</font>

## Task 5. Propose a new approach (2 points)

So far, you should have an idea on how the sequence tagging is done. Now you will have to propose an approach for a new problem.

Imagine that you need to add morphological tagging to your model. The tags are stored in the [_feats_ field](https://universaldependencies.org/format.html#morphological-annotation) of the CoNLL-U format. From the official description, thay have the following format:

> The FEATS field contains a list of morphological features, with vertical bar (|) as list separator and with underscore to represent the empty list. All features should be represented as attribute-value pairs, with an equals sign (=) separating the attribute from the value.

Your task is to describe how you will modify the current POS tagger model to include FEATS tagger as well. Your description must answer the following questions:

- Should you add a new vocab to read the feats? How will you build this vocab?
- POS tagging can be treated as a multi-class classification task, i.e. we assign each word to exaclty one POS tag (which is a class in our situation). What kind of task is morphological tagging (multi-label, binary classification, multiple multi-class tasks)?
- For example, nouns have case and gender attributes but verbs don't have them. How are you going to tackle this?
- Which new layers are you going to introduce into the model?
- What metrics are you going to use to measure the performance of morphological tagging?

You can also try to visualize the model to make it easier to see your concept.

<font color='red'>Your answer here</font>

### Bonus. Hidden Markov Model for POS tagging (1 point)

Hidden Markov Model (HMM) is a probabilistic model, for POS tagging it will take as an input a sequence of words, computes a possible sequences of POS tags from it and then chooses the best sequence of tags. 

The POS tags cannot be observed directly (they are hidden), we only see the words and infer the tags from the sequence. 

HMM is defined with the following components: 

$ Q = Q_1q_2...q_N $ - a set of N states <br>
$ A = a_{11}...a_{ij}...a_{NN}$ - transition probability matrix, each $a_{ij}$ represents the probability of moving from state i to state j.  <br>
$O=o_1o_2...o_T$ - a sequence of T observations <br>
$B=b_i(o_t)$ - sequence of observation likelihoods (emission probabilities), each expresses the probability of an observation $o_t$ being generated from a state $q_i$  <br>
$\pi=\pi_1,\pi_2,...,\pi_N$ - initial probability distribution over states. $+\pi_i£ is the probability that the Markov chain will start in state i. 

We also need to know the two simplifying assumptions:

1) The probability of a particular state depends only on the previous state: <br>
 Makrov Assumption: $$P(q_i|q_1,...q_{i-1})=P(q_i|q_{i-1})$$

2) The probability of an output observation $o_i$ depends only on the state that produced the observation $q_i$ and not on any other states or any other observations: <br>
Output Independence: $$P(o_i|q_1,...q_i,...q_T,o_1,...o_i,...o_T)=P(o_i|q_i)$$


HMM itself has two components: transition probabilities and emission probabilities. 

We can calculate the maximum likelihood estimate for transition probability by counting out of the times we see the first tag in a labeled corpus, how often the first tag is followed by the second. <br>

$$P(t_i|t_{i-1}) = \frac{C(t_{i-1}, t_i)}{C(t_{i-1})}$$


We can calculate the emission probabilities by counting how many times we see the tag in the corpus and how many times this tag is assigned to a specific word:

$$P(w_i|t_i) = \frac{C(t_i, w_i)}{C(t_i)}  $$

You can read more about the HMM from the Dainel Jurafsky & James H. Martin [book](https://web.stanford.edu/~jurafsky/slp3/8.pdf) called Speech and Language Processing. 

In [None]:
import nltk 
import numpy as np 
import pandas as pd
import random
!pip install conllu
from conllu import parse

Read in the UD data that you have already downloaded earlier. Replace the path to train and test dataset with your own downloaded splits. 

In [None]:
train_path = Path('data') / ... / ...
test_path = Path('data') / ... / ... 

train_data = open(train_path, 'r', encoding='utf-8').read()
train_sentences = parse(train_data)
train_set = [[(token.get('form'),token.get('upos')) for token in sentence] for sentence in train_sentences ]

test_data = open(test_path, 'r', encoding='utf-8').read()
test_sentences = parse(test_data)
test_set = [[(token.get('form'),token.get('upos')) for token in sentence ] for sentence in test_sentences]


In [None]:
train_tagged = [ pair for pair in sent for sent in train_set ]
test_tagged = [ pair for pair in sent for sent in test_set ]

In [None]:
pos_tags = list({tag for _,tag in train_tagged})
print(pos_tags)

Now, we have the tagged sentences, the POS tags and vocabulary. 
Next, we need to calculate emission and transition probabilities. Use the formulas given above. 

In [None]:
def emission_probability(word, tag, train=train_tagged):
  c_t1 = ... # count how many times this tag occurs in the dataset
  c_tiwi = ... # count how many times this tag has been associated with the given word
  p_witi = ... # calculate the emission probability 
  return p_witi

In [None]:
def transition_probability(tag1, tag2, train = train_tagged):
    tags = ... # collect all the pos tags from the train
    c_ti = ... # count all the times we see tag1 in the corpus 
    c_ti_ti1 = 0
    for index in range(len(tags)-1):
        if ...:  # count the times we see tag1 before tag2
            c_ti_ti1  += 1
    p_titi1 = ... # calculate the transition_probability
    return p_titi1

Now, let's create a transition matrix: 

In [None]:
transition_matrix = np.zeros((len(pos_tags), len(pos_tags)), dtype='float32')
for i, tag1 in enumerate(list(pos_tags)):
    for j, tag2 in enumerate(list(pos_tags)): 
        transition_matrix[i, j] = ... # assign the correct transition probability (hint: call the transition_probability function)
 
df_tags = pd.DataFrame(transition_matrix, columns=list(pos_tags), index=list(pos_tags))
display(df_tags)

To decode the sequence from HMM we can use Viterbi algorithm, which is a dynamical programming algorithm. For POS tagging, we need to find the most probable tag sequence given the observation sequence of n words. 

There are two assumptions made while decoding tag sequence: 

1) The probability of a word appearing depends only on its own tag and is independent of neighboring words and tags. 

2) The probability of a tag depends only on the previous tag. 

<img src="https://miro.medium.com/max/540/1*8-5KZVj-_jZOWN83gGhD5A.png" >
Image from Medium post *POS Tagging using Hidden Markov Models (HMM) & Viterbi algorithm in NLP mathematics explained* by Mehul Gupta

First, we need to define probability matrix called lattice (let's define it as V). Each row represents a POS tag (hidden state) and each column represents a word (observables). Look at the image above to get an idea of how this matrix looks. We need to fill in this matrix by calculating the probabilities for each cell, for example for $V_{jt}$, we have to calculate the probability that the HMM is in some state j after seeing the first t observations. The $V_{jt}$ can be calculated: 
$$V_{jt}= max(V_{t-1}*a[i,j]*b_j(o_t))$$
Here, $v_{t-1}(i)$ is the previous Viterbi path probability from the previous time step. $a_{ij}$ is the transition probability from previous state $q_i$ to current state $q_j$. $b_j(o_t)$ is the state observation likelihood of the observation symbol $o_t$ given the current state j. 


Your task regarding the Viterbi algorithm is to just fill in the slots in the function below. You need to take the correct transition probability given the previous tag and current tag from the df_tags matrix. You can access the previous tag from the states list. 

Next, you need to calculate the emission probability given the word and the tag. 

After all the probabilities for each POS for the word have been calculated, we will be selecting the maximum probability from the probabilities list and add the best state (POS tag) to the states list. 

In [None]:
def Viterbi(words,tags, transition_matrix=df_tags):
    states = [] # we are saving the best states only 
    for key, word in enumerate(words): # iterating over the observations
        probabilities = [] # saving the probabilities 
        for tag in tags: # iterating over the observations (we are filling one column)
            if key == 0: # if the word is the first one in the sequence
                transition_prob = df_tags.loc['PUNCT', tag]
            else:
                transition_prob = ... 
            
            emission_prob = ...
            state_prob = emission_prob * transition_prob 
            probabilities.append(state_prob)
             
        pmax = ... # take the maximum value from the probabilities list
        state_max = ... # get the state so the probability is maximum (hint: get the index where this maximum probability was in the probabilities list and use it to get the tag from the tags list)
        states.append(state_max)
    return list(zip(words, states))

Let's test the algorithm. Let's choose 15 sentences from the test set and calculate the accuracy on this set. If you want you can choose bigger sample. 

In [None]:
random_set = [random.randint(1,len(test_set)-1) for x in range(15)]
 
test = [test_set[i] for i in random_set]
test_tags = [tup[1] for sent in test for tup in sent]
test_words = [tup[0] for sent in test for tup in sent]
tagged_seq  = Viterbi(test_words, list(pos_tags))

check = [pred  for pred, gold in zip(tagged_seq, test_tags) if pred[1] == gold] 
 
accuracy = len(check)/len(tagged_seq)
print('Accuracy for a small subset: ',accuracy*100)