## CS 224N Lecture 3: Word Window Classification

### Pytorch Exploration

### Author: Matthew Lamm


David note: Going to work through what is happening here

In [2]:
import pprint
import torch
import torch.nn as nn
pp = pprint.PrettyPrinter()

## Our Data

The task at hand is to assign a label of 1 to words in a sentence that correspond with a LOCATION, and a label of 0 to everything else. 

In this simplified example, we only ever see spans of length 1.

In [3]:
# list comprehension to take a list and convert elements to lowercase & split (default is space)
train_sents = [s.lower().split() for s in ["we 'll always have Paris",
                                           "I live in Germany",
                                           "He comes from Denmark",
                                           "The capital of Denmark is Copenhagen"]]

# labels for training - note we mark '1' for any locations
train_labels = [[0, 0, 0, 0, 1],
                [0, 0, 0, 1],
                [0, 0, 0, 1],
                [0, 0, 0, 1, 0, 1]]

# confirm that length of training sentences is equal to number of labels across the total element in each sentence / lavel vector

assert all([len(train_sents[i]) == len(train_labels[i]) for i in range(len(train_sents))])


In [4]:
# do the same for test
test_sents = [s.lower().split() for s in ["She comes from Paris"]]
test_labels = [[0, 0, 0, 1]]

assert all([len(test_sents[i]) == len(test_labels[i]) for i in range(len(test_sents))])

## Creating a dataset of batched tensors.

PyTorch (like other deep learning frameworks) is optimized to work on __tensors__, which can be thought of as a generalization of vectors and matrices with arbitrarily large rank.

Here well go over how to translate data to a list of vocabulary indices, and how to construct *batch tensors* out of the data for easy input to our model. 

We'll use the *torch.utils.data.DataLoader* object handle ease of batching and iteration.

### Converting tokenized sentence lists to vocabulary indices.

Let's assume we have the following vocabulary:

In [5]:
# this represents our corpus
id_2_word = ["<pad>", "<unk>", "we", "always", "have", "paris",
              "i", "live", "in", "germany",
              "he", "comes", "from", "denmark",
              "the", "of", "is", "copenhagen"]

# building out a dictionary that takes a numerical value and assigns to each word
word_2_id = {w:i for i,w in enumerate(id_2_word)}

# 
word_2_id

{'<pad>': 0,
 '<unk>': 1,
 'we': 2,
 'always': 3,
 'have': 4,
 'paris': 5,
 'i': 6,
 'live': 7,
 'in': 8,
 'germany': 9,
 'he': 10,
 'comes': 11,
 'from': 12,
 'denmark': 13,
 'the': 14,
 'of': 15,
 'is': 16,
 'copenhagen': 17}

In [6]:
# just seeing our work
instance = train_sents[0]
print(instance)

['we', "'ll", 'always', 'have', 'paris']


In [9]:
# looking at .get() ---> if we have a missing value, we would get the index for <unk> ---> represents unknown

# example of getting word index when we know a word
print(word_2_id.get('we', word_2_id["<unk>"]))

# how about for an unknown word? We should get "1" back 
print(word_2_id.get('david', word_2_id["<unk>"]))

2
1


In [10]:
def convert_tokens_to_inds(sentence, word_2_id):
    """
    This function takes in a sentence & our word_2_id dict
    Going to take each token in a list comp....
    For each token, applying .get() off dict to return either the words index from our corpus OR a value of "1" for unknown
    """
    return [word_2_id.get(t, word_2_id["<unk>"]) for t in sentence]

In [11]:
# we have now converted our sentence into numerical representation
token_inds = convert_tokens_to_inds(instance, word_2_id)
pp.pprint(token_inds)

[2, 1, 3, 4, 5]


Let's convince ourselves that worked:

In [12]:
# this is just passing the numerical values back to our dict to ensure things worked
# notice that "'ll was unknown'"
print([id_2_word[tok_idx] for tok_idx in token_inds])

['we', '<unk>', 'always', 'have', 'paris']


### Padding for windows.

In the word window classifier, for each word in the sentence we want to get the +/- n window around the word, where 0 <= n < len(sentence).

In order for such windows to be defined for words at the beginning and ends of the sentence, we actually want to insert padding around the sentence before converting to indices:

In [15]:
def pad_sentence_for_window(sentence, window_size, pad_token="<pad>"):
    """
    Generate a padding of pad * window_size 
    Concatenate padding * window + <input sentence> + padding * window
    """
    return [pad_token]*window_size + sentence + [pad_token]*window_size 

In [16]:
# original sentence
print(train_sents[0])

['we', "'ll", 'always', 'have', 'paris']


In [17]:
window_size = 2
instance = pad_sentence_for_window(train_sents[0], window_size)
print(instance)

['<pad>', '<pad>', 'we', "'ll", 'always', 'have', 'paris', '<pad>', '<pad>']


Let's make sure this works with our vocabulary:

In [19]:
for sent in train_sents:
    
    # pretty ugly but passing in the sentence padding function as input to the convert_tokens
    # basically, we first add padding on sent with window size of "window_size"
    # we take the return of this and pass it straight to convert_tokens, which receives our word to index dictionary
    # output is going to be a vector of numbers representing our sentences
    # I think <pad> might be expected by PyTorch? not super clear on this
    tok_idxs = convert_tokens_to_inds(pad_sentence_for_window(sent, window_size), word_2_id)
    
    # again, santity check
    print([id_2_word[idx] for idx in tok_idxs])

['<pad>', '<pad>', 'we', '<unk>', 'always', 'have', 'paris', '<pad>', '<pad>']
['<pad>', '<pad>', 'i', 'live', 'in', 'germany', '<pad>', '<pad>']
['<pad>', '<pad>', 'he', 'comes', 'from', 'denmark', '<pad>', '<pad>']
['<pad>', '<pad>', 'the', '<unk>', 'of', 'denmark', 'is', 'copenhagen', '<pad>', '<pad>']


### Batching sentences together with a DataLoader

When we train our model, we rarely update with respect to a single training instance at a time, because a single instance provides a very noisy estimate of the global loss's gradient. We instead construct small *batches* of data, and update parameters for each batch. 

Given some batch size, we want to construct batch tensors out of the word index lists we've just created with our vocab.

For each length B list of inputs, we'll have to:

    (1) Add window padding to sentences in the batch like we just saw.
    (2) Add additional padding so that each sentence in the batch is the same length.
    (3) Make sure our labels are in the desired format.

At the level of the dataest we want:

    (4) Easy shuffling, because shuffling from one training epoch to the next gets rid of 
        pathological batches that are tough to learn from.
    (5) Making sure we shuffle inputs and their labels together!
    
PyTorch provides us with an object *torch.utils.data.DataLoader* that gets us (4) and (5). All that's required of us is to specify a *collate_fn* that tells it how to do (1), (2), and (3). 

In [21]:
print(type(train_labels[0]))

<class 'list'>


In [22]:
torch.LongTensor([0,0,1])

tensor([0, 0, 1])

In [29]:
# we are just converting target variable to a tensor
l = torch.LongTensor(train_labels[0])
pp.pprint(("raw train label instance", l))
print(l.size())


('raw train label instance', tensor([0, 0, 0, 0, 1]))
torch.Size([5])


In [30]:
# building out one-hot encoding for target: move from 5 x 1 to 5 x 2 (empty for now)
one_hots = torch.zeros((2, len(l)))
pp.pprint(("unfilled label instance", one_hots))
print(one_hots.size())

('unfilled label instance',
 tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]))
torch.Size([2, 5])


In [31]:
# we now pass in the l for the second col
one_hots[1] = l
pp.pprint(("one-hot labels", one_hots))

('one-hot labels', tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1.]]))


In [33]:
?l.byte

In [38]:
~l.byte()  # this gives us 255 for 0s and 254 for the 1? wish this was explained

tensor([255, 255, 255, 255, 254], dtype=torch.uint8)

In [39]:
# building a byte tensor
l_not = ~l.byte()

print(l_not)
# 
one_hots[0] = l_not
pp.pprint(("one-hot labels", one_hots))

tensor([255, 255, 255, 255, 254], dtype=torch.uint8)
('one-hot labels',
 tensor([[255., 255., 255., 255., 254.],
        [  0.,   0.,   0.,   0.,   1.]]))


In [40]:
from torch.utils.data import DataLoader
from functools import partial

In [41]:
# this seems to be doing everything we just did in one function

def my_collate(data, window_size, word_2_id):
    """
    For some chunk of sentences and labels
        -add winow padding
        -pad for lengths using pad_sequence
        -convert our labels to one-hots
        -return padded inputs, one-hot labels, and lengths
    """
    
    # data assumed to be a list which can be turned into two tuples
    x_s, y_s = zip(*data)

    # deal with input sentences as we've seen - this has been doing ---> just converting Xs to padded index
    window_padded = [convert_tokens_to_inds(pad_sentence_for_window(sentence, window_size), word_2_id)
                                                                                  for sentence in x_s]
    
    # this is new: ensure everything is long enough 
    # append zeros to each list of token ids in batch so that they are all the same length
    padded = nn.utils.rnn.pad_sequence([torch.LongTensor(t) for t in window_padded], batch_first=True)
    
    # convert labels to one-hots
    labels = []
    lengths = []
    for y in y_s:
        lengths.append(len(y))
        label = torch.zeros((len(y),2 ))
        true = torch.LongTensor(y) 
        false = ~true.byte()
        label[:, 0] = false
        label[:, 1] = true
        labels.append(label)
    padded_labels = nn.utils.rnn.pad_sequence(labels, batch_first=True) # will take all sequences & pad to max 
    
    return padded.long(), padded_labels, torch.LongTensor(lengths)

In [42]:
# Shuffle True is good practice for train loaders.
# Use functools.partial to construct a partially populated collate function
# batch size = 2 means we get two sentences per batch?
example_loader = DataLoader(list(zip(train_sents, 
                                                      train_labels)), 
                                             batch_size=2, 
                                             shuffle=True, 
                                             collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))

In [43]:
example_loader # just an iterable with all of our batches

<torch.utils.data.dataloader.DataLoader at 0x7f0c915f5f98>

#### How batches look: 

- Each batch has two samples & is used once:
    - These are 1 x 9 length tensors in our first which makes sense because the most tokens we had here was 5, and we asked for padding of 2 on each size
    - There is one example with 6 tokens, which should increase the vector to 2 x 10, which we see below
    
- Each label will be: 5 * 2 when we have 5 tokens
    - 5 represents the tokens 
    - we have one side of 0s and one for 1s....really unclear on why the 254 & 255 are used

In [45]:
for batched_input, batched_labels, batch_lengths in example_loader:
    pp.pprint(("inputs", batched_input, batched_input.size()))
    pp.pprint(batch_lengths)

('inputs',
 tensor([[0, 0, 2, 1, 3, 4, 5, 0, 0],
        [0, 0, 6, 7, 8, 9, 0, 0, 0]]),
 torch.Size([2, 9]))
tensor([5, 4])
('inputs',
 tensor([[ 0,  0, 10, 11, 12, 13,  0,  0,  0,  0],
        [ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0]]),
 torch.Size([2, 10]))
tensor([4, 6])


In [47]:
# iterate through example_loader:
# take batched input, batched labels, batch length
for batched_input, batched_labels, batch_lengths in example_loader:
    pp.pprint(("inputs", batched_input, batched_input.size()))
    pp.pprint(("labels", batched_labels, batched_labels.size()))
    pp.pprint(batch_lengths)
    

('inputs',
 tensor([[0, 0, 6, 7, 8, 9, 0, 0, 0],
        [0, 0, 2, 1, 3, 4, 5, 0, 0]]),
 torch.Size([2, 9]))
('labels',
 tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]]),
 torch.Size([2, 5, 2]))
tensor([4, 5])
('inputs',
 tensor([[ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0],
        [ 0,  0, 10, 11, 12, 13,  0,  0,  0,  0]]),
 torch.Size([2, 10]))
('labels',
 tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [255.,   0.],
         [254.,   1.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.],
         [  0.,   0.]]]),
 torch.Size([2, 6, 2]))
tensor([6, 4])


## Modeling

### Thinking through vectorization of word windows.
Before we go ahead and build our model, let's think about the first thing it needs to do to its inputs.

We're passed batches of sentences. For each sentence i in the batch, for each word j in the sentence, we want to construct a single tensor out of the embeddings surrounding word j in the +/- n window.

Thus, the first thing we're going to need a (B, L, 2N+1) tensor of token indices.

A *terrible* but nevertheless informative *iterative* solution looks something like the following, where we iterate through batch elements in our (dummy), iterating non-padded word positions in those, and for each non-padded word position, construct a window:

In [54]:
# make a 2 x 8 matrix of 0s
dummy_input = torch.zeros(2, 8).long()
print(dummy_input)
print("*************")
# arrange 1 - 9 into a 2 x 4 vector
print(torch.arange(1,9).view(2,4))
print("*************")

# overwrite all rows, and cols between (2, last_col - 2)
# this is just an example of input sentences with 2x padding on either side 
dummy_input[:,2:-2] = torch.arange(1,9).view(2,4)
pp.pprint(dummy_input)

tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]])
*************
tensor([[1, 2, 3, 4],
        [5, 6, 7, 8]])
*************
tensor([[0, 0, 1, 2, 3, 4, 0, 0],
        [0, 0, 5, 6, 7, 8, 0, 0]])


In [55]:
# horribly ugly list comprehsnion for doing the sliding window of 2
[[[dummy_input[i, j-2+k].item() for k in range(2*2+1)] 
                                                     for j in range(2, 6)] 
                                                            for i in range(2)]

[[[0, 0, 1, 2, 3], [0, 1, 2, 3, 4], [1, 2, 3, 4, 0], [2, 3, 4, 0, 0]],
 [[0, 0, 5, 6, 7], [0, 5, 6, 7, 8], [5, 6, 7, 8, 0], [6, 7, 8, 0, 0]]]

In [56]:
dummy_output = [[[dummy_input[i, j-2+k].item() for k in range(2*2+1)] 
                                                     for j in range(2, 6)] 
                                                            for i in range(2)]
dummy_output = torch.LongTensor(dummy_output)
print(dummy_output.size())
pp.pprint(dummy_output)

torch.Size([2, 4, 5])
tensor([[[0, 0, 1, 2, 3],
         [0, 1, 2, 3, 4],
         [1, 2, 3, 4, 0],
         [2, 3, 4, 0, 0]],

        [[0, 0, 5, 6, 7],
         [0, 5, 6, 7, 8],
         [5, 6, 7, 8, 0],
         [6, 7, 8, 0, 0]]])


### What is this?

- each matrix represents a single sentence, with a shifted center word. 
- looks like we can accomplish the same thing using .unfold()

*Technically* it works: For each element in the batch, for each word in the original sentence and ignoring window padding, we've got the 5 token indices centered at that word. But in practice will be crazy slow.

Instead, we ideally want to find the right tensor operation in the PyTorch arsenal. Here, that happens to be __Tensor.unfold__.

In [57]:
dummy_input.unfold(1, 2*2+1, 1)

tensor([[[0, 0, 1, 2, 3],
         [0, 1, 2, 3, 4],
         [1, 2, 3, 4, 0],
         [2, 3, 4, 0, 0]],

        [[0, 0, 5, 6, 7],
         [0, 5, 6, 7, 8],
         [5, 6, 7, 8, 0],
         [6, 7, 8, 0, 0]]])

### A model in full.

In PyTorch, we implement models by extending the nn.Module class. Minimally, this requires implementing an *\_\_init\_\_* function and a *forward* function.

In *\_\_init\_\_* we want to store model parameters (weights) and hyperparameters (dimensions).


In [58]:
# class that inherits stuff from nn.Module class
# super is going to allow us to call the __init__ of the nn.Module class & use that information in our class

class SoftmaxWordWindowClassifier(nn.Module):
    """
    A one-layer, binary word-window classifier.
    
    We initiate with specific window, dimensions, hidden dim, numclasses, etc. 
    We just pass in a config file that has all of the information & defaults by padding set to 0
    """
    def __init__(self, config, vocab_size, pad_idx=0):
        super(SoftmaxWordWindowClassifier, self).__init__()
        """
        Instance variables.; use super to gain access to subclass...so we can reference certain methods below
        """
        self.window_size = 2*config["half_window"]+1
        self.embed_dim = config["embed_dim"]
        self.hidden_dim = config["hidden_dim"]
        self.num_classes = config["num_classes"]
        self.freeze_embeddings = config["freeze_embeddings"]
        
        """
        Embedding layer --> goint to allow us to convert words to integers
        -model holds an embedding for each layer in our vocab
        -sets aside a special index in the embedding matrix for padding vector (of zeros)
        -by default, embeddings are parameters (so gradients pass through them)
        """
        self.embed_layer = nn.Embedding(vocab_size, self.embed_dim, padding_idx=pad_idx)
        if self.freeze_embeddings:
            self.embed_layer.weight.requires_grad = False
        
        """
        Hidden layer---> take embedded word windows and perform initial pass through node
        -we want to map embedded word windows of dim (window_size+1)*self.embed_dim to a hidden layer.
        -nn.Sequential allows you to efficiently specify sequentially structured models
            -first the linear transformation is evoked on the embedded word windows
            -next the nonlinear transformation tanh is evoked.
            
        - we update the hidden_layer
        """
        self.hidden_layer = nn.Sequential(nn.Linear(self.window_size*self.embed_dim, 
                                                    self.hidden_dim), 
                                          nn.Tanh())
        
        """
        Output layer
        -we want to map elements of the output layer (of size self.hidden dim) to a number of classes.
        """
        self.output_layer = nn.Linear(self.hidden_dim, self.num_classes)
        
        """
        Softmax ---> convert to max score 
        -The final step of the softmax classifier: mapping final hidden layer to class scores.
        -pytorch has both logsoftmax and softmax functions (and many others)
        -since our loss is the negative LOG likelihood, we use logsoftmax
        -technically you can take the softmax, and take the log but PyTorch's implementation
         is optimized to avoid numerical underflow issues.
        """
        self.log_softmax = nn.LogSoftmax(dim=2)
        
    def forward(self, inputs):
        """
        Let B:= batch_size
            L:= window-padded sentence length
            D:= self.embed_dim
            S:= self.window_size
            H:= self.hidden_dim
            
        inputs: a (B, L) tensor of token indices
        """
        B, L = inputs.size()
        
        """
        Reshaping.
        Takes in a (B, L) LongTensor
        Outputs a (B, L~, S) LongTensor
        """
        # Fist, get our word windows for each word in our input.
        token_windows = inputs.unfold(1, self.window_size, 1)
        _, adjusted_length, _ = token_windows.size()
        
        # Good idea to do internal tensor-size sanity checks, at the least in comments!
        assert token_windows.size() == (B, adjusted_length, self.window_size)
        
        """
        Embedding.
        Takes in a torch.LongTensor of size (B, L~, S) 
        Outputs a (B, L~, S, D) FloatTensor.
        """
        embedded_windows = self.embed_layer(token_windows)
        
        """
        Reshaping.
        Takes in a (B, L~, S, D) FloatTensor.
        Resizes it into a (B, L~, S*D) FloatTensor.
        -1 argument "infers" what the last dimension should be based on leftover axes.
        """
        embedded_windows = embedded_windows.view(B, adjusted_length, -1)
        
        """
        Layer 1.
        Takes in a (B, L~, S*D) FloatTensor.
        Resizes it into a (B, L~, H) FloatTensor
        """
        layer_1 = self.hidden_layer(embedded_windows)
        
        """
        Layer 2
        Takes in a (B, L~, H) FloatTensor.
        Resizes it into a (B, L~, 2) FloatTensor.
        """
        output = self.output_layer(layer_1)
        
        """
        Softmax.
        Takes in a (B, L~, 2) FloatTensor of unnormalized class scores.
        Outputs a (B, L~, 2) FloatTensor of (log-)normalized class scores.
        """
        output = self.log_softmax(output)
        
        return output

### Training.

Now that we've got a model, we have to train it.

In [59]:
def loss_function(outputs, labels, lengths):
    """Computes negative LL loss on a batch of model predictions."""
    B, L, num_classes = outputs.size()
    num_elems = lengths.sum().float()
        
    # get only the values with non-zero labels
    loss = outputs*labels
    
    # rescale average
    return -loss.sum() / num_elems

In [60]:
def train_epoch(loss_function, optimizer, model, train_data):
    
    ## For each batch, we must reset the gradients
    ## stored by the model.   
    total_loss = 0
    for batch, labels, lengths in train_data:
        # clear gradients
        optimizer.zero_grad()
        # evoke model in training mode on batch
        outputs = model.forward(batch)
        # compute loss w.r.t batch
        loss = loss_function(outputs, labels, lengths)
        # pass gradients back, startiing on loss value
        loss.backward()
        # update parameters
        optimizer.step()
        total_loss += loss.item()
    
    # return the total to keep track of how you did this time around
    return total_loss
    

In [69]:
config = {"batch_size": 4,
          "half_window": 2,
          "embed_dim": 25,
          "hidden_dim": 25,
          "num_classes": 2,
          "freeze_embeddings": False,
         }
learning_rate = .01 # increased slightly 
num_epochs = 10000
model = SoftmaxWordWindowClassifier(config, len(word_2_id))
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [70]:
train_loader = torch.utils.data.DataLoader(list(zip(train_sents, train_labels)), 
                                           batch_size=2, 
                                           shuffle=True, 
                                           collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))

In [71]:
losses = []
for epoch in range(num_epochs):
    epoch_loss = train_epoch(loss_function, optimizer, model, train_loader)
    if epoch % 100 == 0:
        losses.append(epoch_loss)
print(losses)

[233.4204339981079, 3.4482436180114746, 3.4304192066192627, 3.4251338243484497, 3.42595112323761, 3.4210832118988037, 3.4200522899627686, 3.4193774461746216, 3.4220848083496094, 3.4184032678604126, 3.421312093734741, 3.417765736579895, 3.4207944869995117, 3.4173309803009033, 3.4204272031784058, 3.4202773571014404, 3.416868805885315, 3.416768193244934, 3.4166417121887207, 3.416563630104065, 3.419771432876587, 3.419699549674988, 3.419638991355896, 3.419572114944458, 3.4162217378616333, 3.4161837100982666, 3.416123151779175, 3.41937792301178, 3.41605281829834, 3.419306516647339, 3.4159823656082153, 3.4159417152404785, 3.41591477394104, 3.415880560874939, 3.4158592224121094, 3.4191243648529053, 3.419097900390625, 3.415781617164612, 3.419055461883545, 3.4157304763793945, 3.415719747543335, 3.4156914949417114, 3.4156795740127563, 3.415666699409485, 3.4189419746398926, 3.4189257621765137, 3.4156134128570557, 3.4188947677612305, 3.4155913591384888, 3.418869376182556, 3.4155662059783936, 3.4155

### Prediction.

In [72]:
test_loader = torch.utils.data.DataLoader(list(zip(test_sents, test_labels)), 
                                           batch_size=1, 
                                           shuffle=False, 
                                           collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))

In [73]:
for test_instance, labs, _ in test_loader:
    outputs = model.forward(test_instance)
    print(torch.argmax(outputs, dim=2))
    print(torch.argmax(labs, dim=2))

tensor([[0, 0, 0, 0]])
tensor([[0, 0, 0, 0]])
