## Loading and preprocessing the data

First, we load the raw data using the starter code

In [4]:
def load_data(path, lowercase=True):
    sents = []
    tags = []
    with open(path, 'r', encoding="utf8") as f:
        for line in f.read().splitlines():
            sent = []
            tag = []
            for pair in line.split('####')[1].split(' '):
                tn, tg = pair.rsplit('=', 1)
                if lowercase:
                    sent.append(tn.lower())
                else:
                    sent.append(tn)
                tag.append(tg)
            sents.append(sent)
            tags.append(tag)
    return sents, tags

In [5]:
train_sents, train_tags = load_data("data/twitter1_train.txt")

After loading the raw tweets and tags, we have to clean the data and do a few preprocessing steps before we can start training. PyTorch requires us to label all tokens using positive integers, so in this step we will build a hashmap to help us convert raw sequences of alphanumeric tokens to sequences of integers.

In [6]:
def get_vocab_idx(train):
    tokens = set()
    for sent in train:
        tokens.update(sent)
    tokens = sorted(list(tokens))
    vocab2idx = dict(zip(tokens, range(1, len(tokens)+1)))
    vocab2idx["<PAD>"] = 0
    return vocab2idx

def convert_to_idx(sents, word2idx):
    for sent in sents:
        for i in range(len(sent)):
            sent[i] = word2idx[sent[i]]

In [11]:
vocab2idx = get_vocab_idx(train_sents)
convert_to_idx(train_sents, vocab2idx)

[[4739, 2469, 9381, 9352, 6282, 6918, 7228, 10510, 9514, 6272, 7781, 4089, 7890, 615, 10767, 861, 239], [1067, 483, 10772, 6189, 10476, 9875, 853], [9535, 7225, 2447, 7117, 8684, 844, 1695, 7274, 10586, 8125, 7437, 452, 4852], [7059, 3451, 6824, 3472, 7176, 4089, 9013, 7821, 3215, 5921], [7809, 9919, 3172, 600, 451, 6272, 6874, 9871, 9842, 4731, 451, 9881, 2065, 8594, 9842, 7634, 9388, 4089, 9884, 4023, 1574, 8181, 4126, 7873, 484, 5330], [6272, 602, 451, 9881, 4394, 3316, 7704, 2814, 9988, 8447, 9875, 8888, 9264, 8553, 1661, 9881, 3936, 7770, 8888, 10699, 2056, 10369, 8914, 1], [7400, 1436, 451, 9157, 10495, 9881, 6794, 7812], [3473, 9781, 9387, 844, 8264, 7405, 10237, 631, 433, 703, 8667, 9781, 9388, 436, 5701], [9381, 1850, 1734, 7023, 9884, 8407, 6413, 451, 1661, 4326, 4551, 3362, 6454, 3813, 6454, 5018, 131, 225], [6189, 6595, 10250, 9881, 8095, 1967, 7809, 4127, 1, 8975, 2639, 1, 5964], [2033, 10484, 844, 3576, 9988, 10223, 6317, 864, 5119], [2910, 7466, 7545, 4089, 7345, 9422, 1

We also need to convert the tags into integers that can then be fed to PyTorch's categorical cross entropy loss function

In [9]:
tag2idx = {"<PAD>": 0, "O": 1, "T-NEG": 2, "T-NEU": 3, "T-POS": 4}
convert_to_idx(train_tags, tag2idx)

KeyError: 1

## Building the model

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F

Any computation model in PyTorch, from a single layer to an entire model, inherits the base class nn.Module. To build our own model, we will need to construct our own class that inherits from nn.Module as well. Other than the constructor function, we will need to fill out the `forward()` function, which takes in the data and returns the output predictions, and will be implicitly called when we feed our data to an instance of our model. 

WARNING: this example below is only to help you understand how to use PyTorch, and is NOT what you will be doing for this homework! It predicts the current label based on only the contextualized representation vector of the current token. In your model, you will be predicting the current label based on the previous label as well as the vector representation of the current token

In [7]:
class EncodeSentence(nn.Module):
    def __init__(self, word_dim, num_words, num_tags):
        super(EncodeSentence, self).__init__()
        self.word_emb = nn.Embedding(num_words, word_dim, padding_idx=0)
        # Output dimension is word_dim//2 because bidirectional doubles the output dimension
        self.contextualizer = nn.LSTM(word_dim, word_dim//2, batch_first=True, bidirectional=True)
        self.output = nn.Linear(word_dim, num_tags)
        
    def forward(self, sent):
        sent_embed = self.word_emb(sent)
        sent_embed, _ = self.contextualizer(sent_embed)
        output_scores = self.output(sent_embed)
        return F.log_softmax(output_scores, dim=1)

PyTorch cannot handle jagged arrays, so if we want to process our input data in batches (with sentences in a batch having different lengths), we have to either pad or trim all sentences to the same size. In this example, I'm padding all sentences in a batch of size 10 so that they all have the same length as the longest sentence.

In [8]:
def pad_sents(sents, pad_idx=0):
    padded_sents = []
    maxlen = max([len(sent) for sent in sents])
    for sent in sents:
        padded_sent = sent.copy()
        padded_sent.extend([pad_idx]*(maxlen-len(sent)))
        padded_sents.append(padded_sent)
    return padded_sents

train_batch = pad_sents(train_sents[:10])
train_label = pad_sents(train_tags[:10])

Now we create an instance of our model, and feed our batch to it to get prediction scores for each label type for each token for all sentences in our batch

In [9]:
model = EncodeSentence(256, len(vocab2idx), len(tag2idx))
scores = model(torch.tensor(train_batch))

## Training the model

To optimize our model parameters, we need to create an instance of any optimizer algorithms provided by PyTorch. Here we're using ADAM. `model.parameters()` will return all the parameters of our model, which will be used by our optimizer instance. If you have more than one model, and you want to optimize all of them end-to-end, then do not forget to feed all of them to the same optimizer!

In [15]:
import torch.optim as optim
optimizer = optim.Adam(list(model.parameters()))

`loss_sum` allows us to accumulate the gradient of all time steps up to `maxlen`, which is the length of the longest sequence in the training batch. This step is necessary because `F.nll_loss()` doesn't allow us to calculate the loss of all time steps simultaneously. If you manage to find a loss function provided by PyTorch that actually calculate categorical softmax loss of the entire sequence in one call, feel free to use it!

In [13]:
maxlen = scores.size(1)
loss_sum = torch.tensor([0.],)
for i in range(maxlen):
    loss_sum += F.nll_loss(scores[:, i, :], torch.tensor(train_label)[:, i])

After we've accumulated the loss at all time steps, backpropagate it through the model to calculate gradient for all parameters, and call `optimizer.step()` to perform a single parameter update througout the model.

In [16]:
loss_sum.backward()
optimizer.step()