## Word Window Classification

In this lab we will attempt to solve a simple toy NLP task. Here are the things we will learn:

1. Data: Creating a Dataset of Batched Tensors
2. Modeling
3. Training
4. Prediction 

In this section, our goal will be to train a model that will find the words in a sentence corresponding to a `LOCATION`, which will be always of span `1` (meaning that `San Fransisco` won't be recognized as a `LOCATION`). Our task is called `Word Window Classification` for a reason. Instead of letting our model to only take a look at one word in each forward pass, we would like it to be able to consider the context of the word in question. That is, for each word, we want our model to be aware of the surrounding words. Let's dive in!

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

#### Data

The very first task of any machine learning project is to set up our training set. Usually, there will be a training corpus we will be utilizing. In NLP tasks, the corpus would generally be a `.txt` or `.csv` file where each row corresponds to a sentence or a tabular datapoint. In our toy task, we will assume that we have already read our data and the corresponding labels into a `Python` list.

In [2]:
### This is our training dataset :)
# Our raw data, which consists of sentences
corpus = [
          "We always come to Paris",
          "The professor is from Australia",
          "I live in Stanford",
          "He comes from Taiwan",
          "The capital of Turkey is Ankara"
         ]

#### Task-1: Preprocessing

To make it easier for our models to learn, we usually apply a few preprocessing steps to our data. This is especially important when dealing with text data. Here are some examples of text preprocessing:
* **Tokenization**: Tokenizing the sentences into words.
* **Lowercasing**: Changing all the letters to be lowercase.
* **Noise removal:** Removing special characters (such as punctuations).
* **Stop words removal**: Removing commonly used words.

Which preprocessing steps are necessary is determined by the task at hand. For example, although it is useful to remove special characters in some tasks, for others they may be important (for example, if we are dealing with multiple languages). For our task, we will lowercase our words and tokenize.


In [3]:
# The preprocessing function we will use to generate our training examples
# Our function is a simple one, we lowercase the letters
# and then tokenize the words.
def preprocess_sentence(sentence):
  return sentence.lower().split()
  # return None #To Do: task-1a lowercase and tokenize

# Create our training set
train_sentences = [preprocess_sentence(sent) for sent in corpus]
train_sentences ### lowercase and tokenize

[['we', 'always', 'come', 'to', 'paris'],
 ['the', 'professor', 'is', 'from', 'australia'],
 ['i', 'live', 'in', 'stanford'],
 ['he', 'comes', 'from', 'taiwan'],
 ['the', 'capital', 'of', 'turkey', 'is', 'ankara']]

#### Simple check task-1a

In [4]:
test_train_sentences = [['we', 'always', 'come', 'to', 'paris'],
                    ['the', 'professor', 'is', 'from', 'australia'],
                    ['i', 'live', 'in', 'stanford'],
                    ['he', 'comes', 'from', 'taiwan'],
                    ['the', 'capital', 'of', 'turkey', 'is', 'ankara']]
assert test_train_sentences == train_sentences, 'test failed!'
print('Pass!')

Pass!


For each training example we have, we should also have a corresponding label. Recall that the goal of our model was to determine which words correspond to a `LOCATION`. That is, we want our model to output `0` for all the words that are not `LOCATION`s and `1` for the ones that are `LOCATION`s.

In [5]:
# Set of locations that appear in our corpus
locations = set(["australia", "ankara", "paris", "stanford", "taiwan", "turkey"])

# Our train labels
train_labels = [[1 if word in locations else 0 for word in sent] for sent in train_sentences]
# train_labels = None #To Do: task-1b produce the correct training labels
train_labels  ### Construct the true labels of the dataset

[[0, 0, 0, 0, 1],
 [0, 0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1, 0, 1]]

#### Simple check task-1b

In [6]:
test_train_labels = [[0, 0, 0, 0, 1],
                        [0, 0, 0, 0, 1],
                        [0, 0, 0, 1],
                        [0, 0, 0, 1],
                        [0, 0, 0, 1, 0, 1]]
assert test_train_labels == train_labels, 'test failed!'
print('Pass!')

Pass!


#### Converting Words to Embeddings

Let's look at our training data a little more closely. Each datapoint we have is a sequence of words. On the other hand, we know that machine learning models work with numbers in vectors. How are we going to turn words into numbers? You may be thinking embeddings and you are right!

Imagine that we have an embedding lookup table `E`, where each row corresponds to an embedding. That is, each word in our vocabulary would have a corresponding embedding row `i` in this table. Whenever we want to find an embedding for a word, we will follow these steps:
1. Find the corresponding index `i` of the word in the embedding table: `word->index`.
2. Index into the embedding table and get the embedding: `index->embedding`.

Let's look at the first step. We should assign all the words in our vocabulary to a corresponding index. We can do it as follows:
1. Find all the unique words in our corpus.
2. Assign an index to each.

In [7]:
# Find all the unique words in our corpus
vocabulary = set(w for s in train_sentences for w in s)
# vocabulary = None #To Do: task-1c find the unique words from the training set
vocabulary   ### construct our unique vocabulary

{'always',
 'ankara',
 'australia',
 'capital',
 'come',
 'comes',
 'from',
 'he',
 'i',
 'in',
 'is',
 'live',
 'of',
 'paris',
 'professor',
 'stanford',
 'taiwan',
 'the',
 'to',
 'turkey',
 'we'}

#### Simple check task-1c

In [8]:
test_vocabulary = {'always',
                    'ankara',
                    'australia',
                    'capital',
                    'come',
                    'comes',
                    'from',
                    'he',
                    'i',
                    'in',
                    'is',
                    'live',
                    'of',
                    'paris',
                    'professor',
                    'stanford',
                    'taiwan',
                    'the',
                    'to',
                    'turkey',
                    'we'}
assert test_vocabulary == vocabulary, 'test failed!'
print('Pass!')

Pass!


`vocabulary` now contains all the words in our corpus. On the other hand, during the test time, we can see words that are not contained in our vocabulary. If we can figure out a way to represent the unknown words, our model can still reason about whether they are a `LOCATION` or not, since we are also looking at the neighboring words for each prediction.

We introduce a special token, `<unk>`, to tackle the words that are out of vocabulary. We could pick another string for our unknown token if we wanted. The only requirement here is that our token should be unique: we should only be using this token for unknown words. We will also add this special token to our vocabulary.

In [9]:
# Add the unknown token to our vocabulary
vocabulary.add("<unk>")

Earlier we mentioned that our task was called `Word Window Classification` because our model is looking at the surroundings words in addition to the given word when it needs to make a prediction.

For example, let's take the sentence "We always come to Paris". The corresponding training label for this sentence is `0, 0, 0, 0, 1` since only Paris, the last word, is a `LOCATION`. In one pass (meaning a call to `forward()`), our model will try to generate the correct label for one word. Let's say our model is trying to generate the correct label `1` for `Paris`. If we only allow our model to see `Paris`, but nothing else, we will miss out on the important information that the word `to` often times appears with `LOCATION`s.

Word windows allow our model to consider the surrounding `+N` or `-N` words of each word when making a prediction. In our earlier example for `Paris`, if we have a window size of 1, that means our model will look at the words that come immediately before and after `Paris`, which are `to`, and, well, nothing. Now, this raises another issue. `Paris` is at the end of our sentence, so there isn't another word following it. Remember that we define the input dimensions of our `PyTorch` models when we are initializing them. If we set the window size to be `1`, it means that our model will be accepting `3` words in every pass. We cannot have our model expect `2` words from time to time.

The solution is to introduce a special token, such as `<pad>`, that will be added to our sentences to make sure that every word has a valid window around them. Similar to `<unk>` token, we could pick another string for our pad token if we wanted, as long as we make sure it is used for a unique purpose.

In [10]:
# Add the <pad> token to our vocabulary
vocabulary.add("<pad>")

# Function that pads the given sentence
# We are introducing this function here as an example
# We will be utilizing it later in the tutorial
def pad_window(sentence, window_size, pad_token="<pad>"):
  window = [pad_token] * window_size
  # window = None #To Do: task-1d add pad_token to the sentence considering the window size
  return window + sentence + window

# Show padding example
window_size = 2
test_pading = pad_window(train_sentences[0], window_size=window_size)
test_pading   ### Do not forget to PAD your sentences

['<pad>', '<pad>', 'we', 'always', 'come', 'to', 'paris', '<pad>', '<pad>']

#### Simple check task-1d

In [11]:
crct_pad_vec = ['<pad>', '<pad>', 'we', 'always', 'come', 'to', 'paris', '<pad>', '<pad>']
assert test_pading == crct_pad_vec, 'test failed!'
print('Pass!')

Pass!


Now that our vocabularly is ready, let's assign an index to each of our words.

In [12]:
# We are just converting our vocabularly to a list to be able to index into it
# Sorting is not necessary, we sort to show an ordered word_to_ind dictionary
# That being said, we will see that having the index for the padding token
# be 0 is convenient as some PyTorch functions use it as a default value
# such as nn.utils.rnn.pad_sequence, which we will cover in a bit
ix_to_word = sorted(list(vocabulary))

# Creating a dictionary to find the index of a given word
word_to_ix = {word: ind for ind, word in enumerate(ix_to_word)}
# word_to_ix = None #To Do: task-1e construct the nedded dic
word_to_ix

{'<pad>': 0,
 '<unk>': 1,
 'always': 2,
 'ankara': 3,
 'australia': 4,
 'capital': 5,
 'come': 6,
 'comes': 7,
 'from': 8,
 'he': 9,
 'i': 10,
 'in': 11,
 'is': 12,
 'live': 13,
 'of': 14,
 'paris': 15,
 'professor': 16,
 'stanford': 17,
 'taiwan': 18,
 'the': 19,
 'to': 20,
 'turkey': 21,
 'we': 22}

#### Simple check task-1e

In [13]:
test_word_to_ix = {'<pad>': 0,
                    '<unk>': 1,
                    'always': 2,
                    'ankara': 3,
                    'australia': 4,
                    'capital': 5,
                    'come': 6,
                    'comes': 7,
                    'from': 8,
                    'he': 9,
                    'i': 10,
                    'in': 11,
                    'is': 12,
                    'live': 13,
                    'of': 14,
                    'paris': 15,
                    'professor': 16,
                    'stanford': 17,
                    'taiwan': 18,
                    'the': 19,
                    'to': 20,
                    'turkey': 21,
                    'we': 22}
assert test_word_to_ix == word_to_ix, 'test failed!'
print('Pass!')

Pass!


Great! We are ready to convert our training sentences into a sequence of indices corresponding to each token.

In [14]:
# Given a sentence of tokens, return the corresponding indices
def convert_token_to_indices(sentence, word_to_ix): ### convert words into indices so emmbedings can be found later using nn.Embedding
  indices = []
  for token in sentence:
    # Check if the token is in our vocabularly. If it is, get it's index.
    # If not, get the index for the unknown token.
    # To Do: task-1f
    # pass
    if token in word_to_ix:
      index = word_to_ix[token]
    else:
      index = word_to_ix["<unk>"]
    indices.append(index)
  return indices

# More compact version of the same function
def _convert_token_to_indices(sentence, word_to_ix):
  return [word_to_ind.get(token, word_to_ix["<unk>"]) for token in sentence]

# Show an example
example_sentence = ["we", "always", "come", "to", "kuwait"]
example_indices = convert_token_to_indices(example_sentence, word_to_ix)
restored_example = [ix_to_word[ind] for ind in example_indices]

print(f"Original sentence is: {example_sentence}")
print(f"Going from words to indices: {example_indices}")
print(f"Going from indices to words: {restored_example}")

Original sentence is: ['we', 'always', 'come', 'to', 'kuwait']
Going from words to indices: [22, 2, 6, 20, 1]
Going from indices to words: ['we', 'always', 'come', 'to', '<unk>']


#### Simple check task-1f

In [15]:
assert ['we', 'always', 'come', 'to', 'kuwait'] == example_sentence, 'test failed!'
assert [22, 2, 6, 20, 1] == example_indices, 'test failed!'
assert ['we', 'always', 'come', 'to', '<unk>'] == restored_example, 'test failed!'
print('Pass!')

Pass!


In the example above, `kuwait` shows up as `<unk>`, because it is not included in our vocabulary. Let's convert our `train_sentences` to `example_padded_indices`.

In [16]:
# Converting our sentences to indices
example_padded_indices = [convert_token_to_indices(s, word_to_ix) for s in train_sentences]
example_padded_indices

[[22, 2, 6, 20, 15],
 [19, 16, 12, 8, 4],
 [10, 13, 11, 17],
 [9, 7, 8, 18],
 [19, 5, 14, 21, 12, 3]]

Now that we have an index for each word in our vocabularly, we can create an embedding table with `nn.Embedding` class in `PyTorch`. It is called as follows `nn.Embedding(num_words, embedding_dimension)` where `num_words` is the number of words in our vocabulary and the `embedding_dimension` is the dimension of the embeddings we want to have. There is nothing fancy about `nn.Embedding`: it is just a wrapper class around a trainabe `NxE` dimensional tensor, where `N` is the number of words in our vocabulary and `E` is the number of embedding dimensions. This table is initially random, but it will change over time. As we train our network, the gradients will be backpropagated all the way to the embedding layer, and hence our word embeddings would be updated. We will initiliaze the embedding layer we will use for our model in our model, but we are showing an example here.

In [17]:
# Creating an embedding table for our words
embedding_dim = 5
embeds = nn.Embedding(len(vocabulary), embedding_dim)
# embeds = None #To Do: task-1g use nn.Embedding

# Printing the parameters in our embedding table
assert len(list(embeds.parameters())[0]) == 23, 'simple test failed!'
list(embeds.parameters())

[Parameter containing:
 tensor([[-6.4225e-01, -1.1382e+00,  2.4819e-01,  4.2047e-01,  5.3647e-01],
         [ 7.3446e-01,  1.5653e+00, -4.2861e-01,  5.3950e-01, -2.4080e+00],
         [-1.0467e+00,  4.5272e-04,  5.1691e-01,  9.7846e-01,  3.4240e-01],
         [ 9.7249e-01, -1.8092e+00, -9.6482e-01,  8.0514e-01, -1.2050e+00],
         [-4.2263e-01, -1.4729e+00, -6.5408e-01, -7.5241e-01,  1.5103e+00],
         [ 4.2020e-01, -2.7646e-01,  4.1917e-01, -9.5254e-01, -9.9322e-01],
         [ 1.2536e+00,  1.2376e+00,  8.3681e-01, -1.5655e+00, -3.7557e-01],
         [-5.6801e-01, -1.0017e+00, -4.0199e-01,  1.5494e+00,  1.6790e-01],
         [ 1.0326e+00,  4.4658e-01, -1.2431e+00, -1.1461e+00,  8.2310e-02],
         [-2.5752e-01, -9.1020e-01,  5.7974e-01,  1.6198e+00, -1.1124e-03],
         [ 3.5792e-01,  1.4910e-02, -1.0764e+00, -2.2492e-01, -3.8519e-01],
         [ 7.4097e-01,  1.4309e+00, -1.5491e+00, -5.7491e-01,  8.5422e-01],
         [ 4.4444e-01,  1.7229e+00,  4.9764e-01,  6.2893e-01, -6.

To get the word embedding for a word in our vocabulary, all we need to do is to create a lookup tensor. The lookup tensor is just a tensor containing the index we want to look up `nn.Embedding` class expects an index tensor that is of type Long Tensor, so we should create our tensor accordingly.

In [18]:
# Get the embedding for the word Paris
index = word_to_ix["paris"]
index_tensor = torch.tensor(index, dtype=torch.long)
paris_embed = embeds(index_tensor)
paris_embed

tensor([ 2.7388,  0.5130,  0.0722, -0.1011, -0.8472],
       grad_fn=<EmbeddingBackward0>)

In [19]:
# We can also get multiple embeddings at once
index_paris = word_to_ix["paris"]
index_ankara = word_to_ix["ankara"]
indices = [index_paris, index_ankara]
indices_tensor = torch.tensor(indices, dtype=torch.long)
embeddings = embeds(indices_tensor)
embeddings

tensor([[ 2.7388,  0.5130,  0.0722, -0.1011, -0.8472],
        [ 0.9725, -1.8092, -0.9648,  0.8051, -1.2050]],
       grad_fn=<EmbeddingBackward0>)

Usually, we define the embedding layer as part of our model, which you will see in the later sections of the notebook.