<a href="https://colab.research.google.com/github/akaver/NLP2019/blob/master/Lab10_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lab, we will use Pytorch (and torchtext) to classify surnames to nationalities using a recurrent neural net.

Use a Colab runtime with GPU so that training is faster.



In [0]:
import torch 

device = 'cpu'
if torch.cuda.is_available():
  device = torch.device('cuda')

print(device)

cuda


First, we'll download the datafiles:

In [0]:
!rm -f names_train.csv names_test.csv
!wget -q --no-check-certificate https://phon.ioc.ee/~tanela/tmp/names_train.csv
!wget -q --no-check-certificate https://phon.ioc.ee/~tanela/tmp/names_test.csv

The data files are CSV (actually, TSV -- tab-separated values) files, containing the name and the corresponding nationality.

In [0]:
!head names_train.csv

Tadhgan	Irish
Kingsley	English
Fernández	Spanish
Paterson	English
Friel	English
Bahar	Arabic
Mifsud	Arabic
Vedischev	Russian
Suchanka	Czech
Jindra	Czech


Let's load the data into Torchtext dataset:

In [0]:
from torchtext import data
from torchtext import datasets

First, we define the fields. We use a special tokenizer (`list`) for our NAME field which results in the characters of the name to be treated as individual tokens:

In [0]:
NAME = data.Field(tokenize=list, init_token="<bos>", eos_token="<eos>", include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)

train_dataset, test_dataset = data.TabularDataset.splits(path=".", train='names_train.csv', test='names_test.csv', format='tsv', skip_header=True, fields=[('name', NAME), ('label', LABEL)])

In [0]:
print(len(train_dataset))
print(train_dataset[0].name, train_dataset[0].label)

8232
['K', 'i', 'n', 'g', 's', 'l', 'e', 'y'] English


Next, we can build the vocabularies for both our fields:

In [0]:
NAME.build_vocab(train_dataset)
LABEL.build_vocab(train_dataset)

Here is the mapping from nationalities to IDs. Again, the vocabulary uses a special ID for the unknown labels (`<unk>`) that we don't need. Therefore, we'll subtract 1 from the label ID later when we train the model.

In [0]:
LABEL.vocab.stoi

defaultdict(<function torchtext.vocab._default_unk_index>,
            {'<unk>': 0,
             'Arabic': 3,
             'Chinese': 11,
             'Czech': 7,
             'Dutch': 9,
             'English': 1,
             'French': 10,
             'German': 5,
             'Greek': 13,
             'Irish': 12,
             'Italian': 6,
             'Japanese': 4,
             'Korean': 15,
             'Polish': 14,
             'Portuguese': 18,
             'Russian': 2,
             'Scottish': 16,
             'Spanish': 8,
             'Vietnamese': 17})

Next, we define the batch iterators for both or train and test data.

In [0]:
train_iter, test_iter = data.BucketIterator.splits((train_dataset, test_dataset), batch_size=32,  device=device, repeat=False)
# The following two lines are needed to work around a bug in Torchtext. It will be fixed soon (I hope)
test_iter.sort_within_batch = train_iter.sort_within_batch
test_iter.sort = train_iter.sort


The iterators return batches of data and corresponding labels, all turned into matrices.

Note that we used `include_lengths=True` when constructing the NAME field. Because of that, the `name` field of each batch includes two tensors: a 32 x `max_length` tensor of character IDs in the batch, and a vector of 32 elements whose values correspond to actual name lengths in this batch. We will need the lengths later.

In [0]:
batch = next(iter(test_iter))
print(batch)


[torchtext.data.batch.Batch of size 32]
	[.name]:('[torch.cuda.LongTensor of size 32x15 (GPU 0)]', '[torch.cuda.LongTensor of size 32 (GPU 0)]')
	[.label]:[torch.cuda.LongTensor of size 32 (GPU 0)]


In [0]:
print(batch.name)

(tensor([[ 2, 38,  6, 13, 57,  3,  1,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2, 27,  6,  9,  5,  3,  1,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2, 27,  6, 14,  9, 19,  3,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2, 40,  5, 22, 31,  4, 15,  3,  1,  1,  1,  1,  1,  1,  1],
        [ 2, 48, 22,  7,  8,  4, 13,  4,  3,  1,  1,  1,  1,  1,  1],
        [ 2, 36,  4, 12,  4, 11,  3,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2, 47, 12, 14,  4,  8,  3,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2, 24, 21, 21,  6,  8, 21,  7,  3,  1,  1,  1,  1,  1,  1],
        [ 2, 25,  7, 30, 10, 14, 16,  3,  1,  1,  1,  1,  1,  1,  1],
        [ 2, 38, 12,  5, 15, 12,  6,  5, 18,  3,  1,  1,  1,  1,  1],
        [ 2, 43, 10, 10,  6, 37,  3,  1,  1,  1,  1,  1,  1,  1,  1],
        [ 2, 36,  6, 10,  5,  8, 22, 11,  6,  6, 17,  3,  1,  1,  1],
        [ 2, 25,  7, 21, 12,  4, 11,  6, 18,  7, 21,  6, 18,  4,  3],
        [ 2, 44,  4, 11,  5,  8, 16,  7, 15,  3,  1,  1,  1,  1,  1],
        [ 2, 50, 10

Now, we can define our recurrent neural net for name classification.

In [0]:
import sys
import torch.nn as nn
import torch.nn.functional as F

In [0]:
class NamesClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, input_vocab_size, output_vocab_size, num_rnn_layers=1):
        super(NamesClassifier, self).__init__()
        
        # We start with a character embedding layer
        self.emb = nn.Embedding(input_vocab_size, embedding_dim, padding_idx=NAME.vocab.stoi["<pad>"])
        # We use the GRU recurrent layer
        # Try with LSTM instead. 
        # Also, you can experiment with bidirectional=True, but you have to multiply the input 
        # dim of the next layer with 2 then
        self.rnn = nn.GRU(embedding_dim, hidden_dim, batch_first=True, bidirectional=False, num_layers=num_rnn_layers)
        # Finally, we classification layer
        self.affine = nn.Linear(hidden_dim, output_vocab_size)
        
    
    def forward(self, x_in, sequence_lengths):
        x_embedded = self.emb(x_in)
        
        x_post_rnn, _ = self.rnn(x_embedded)

        # We take the RNN output not from the last timestep (corresponding to maximum name length of the batch
        # but from the actual last time step of the corresponding name.
        # It's a bit tricky :)
        last_item_indices = sequence_lengths - 1
        last_item_indices += torch.arange(0, x_in.size(0)).long().to(device) * x_in.size(1)
        x_post_rnn = x_post_rnn.contiguous().to(device)
        x_post_rnn = x_post_rnn.view(x_in.size(0) * x_in.size(1), -1)[last_item_indices]

        out = self.affine(x_post_rnn)
        return F.log_softmax(out, dim=1)


Next, the training and evaluation loops. They are almost identical to the ones used in the last lab. The only difference is that here we also take care of sending the name lengths to the `forward()` function.

In [0]:
def train(model, num_epochs, train_iter, dev_iter):

  optimizer = torch.optim.Adam(model.parameters())

  best_acc = 0
  last_step = 0
  for epoch in range(1, num_epochs+1):
    print("Epoch %d" % epoch)
    model.train()
    for batch in train_iter:
      (text, text_length), target = batch.name, batch.label

      # subtract one from label ID because we don't have <unk> labels
      target -= 1

      optimizer.zero_grad()
      output = model(text, sequence_lengths=text_length)

      loss = F.nll_loss(output, target)
      loss.backward()
      optimizer.step()


    train_acc = evaluate("train", train_iter, model)                
    dev_acc = evaluate("dev", dev_iter, model)

def evaluate(dataset_name, data_iter, model):
  
  model.eval()
  total_corrects, avg_loss = 0, 0
  with torch.no_grad():
    for batch in data_iter:
      (text, text_length), target = batch.name, batch.label


      # subtract one from label ID because we don't have <unk> labels
      target -= 1
      output = model(text, sequence_lengths=text_length)
      loss = F.nll_loss(output, target, size_average=False)
      pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
      correct = pred.eq(target.view_as(pred)).sum().item()

      avg_loss += loss
      total_corrects += correct


  size = len(data_iter.dataset)
  avg_loss /= size
  accuracy = 100.0 * total_corrects/size
  print('  Evaluation on {} - loss: {:.6f}  acc: {:.4f}%({}/{})'.format(dataset_name,
                                                                     avg_loss, 
                                                                     accuracy, 
                                                                     total_corrects, 
                                                                     size))
  return accuracy                



Now we can train our model:

In [0]:
model = NamesClassifier(100, 100, len(NAME.vocab), len(LABEL.vocab) - 1, num_rnn_layers=1).to(device)
train(model, 10, train_iter, test_iter)

Epoch 1




  Evaluation on train - loss: 1.171059  acc: 67.1526%(5528/8232)
  Evaluation on dev - loss: 1.208234  acc: 66.2782%(1820/2746)
Epoch 2
  Evaluation on train - loss: 0.930593  acc: 72.4733%(5966/8232)
  Evaluation on dev - loss: 1.013790  acc: 70.5390%(1937/2746)
Epoch 3
  Evaluation on train - loss: 0.777180  acc: 76.6642%(6311/8232)
  Evaluation on dev - loss: 0.897149  acc: 73.7800%(2026/2746)
Epoch 4
  Evaluation on train - loss: 0.665004  acc: 79.6890%(6560/8232)
  Evaluation on dev - loss: 0.841849  acc: 75.4552%(2072/2746)
Epoch 5
  Evaluation on train - loss: 0.590366  acc: 82.3494%(6779/8232)
  Evaluation on dev - loss: 0.812068  acc: 74.9454%(2058/2746)
Epoch 6
  Evaluation on train - loss: 0.505958  acc: 84.8518%(6985/8232)
  Evaluation on dev - loss: 0.780005  acc: 76.8390%(2110/2746)
Epoch 7
  Evaluation on train - loss: 0.460340  acc: 85.8722%(7069/8232)
  Evaluation on dev - loss: 0.787504  acc: 76.7662%(2108/2746)
Epoch 8
  Evaluation on train - loss: 0.405871  acc: 87.

OK, so the accuracy of the model on test data is as above. 

Experiment with hypeparameters (embedding size, number of units in the hidden RNN layer, number of RNN layers, etc), try adding dropout and see if you can increase the accuracy.

It would be also nice if we could apply the trained model on new data, on case-by-case basis, so that we could play with the model. Let's do that.

First, we have to convert the user-given name to a tensor. We can use the `process()` method of the corresponding field for that:

In [0]:
print(NAME.process(["Alumäe"], device=device))

(tensor([[ 2, 24, 11, 14, 17, 59,  5,  3]], device='cuda:0'), tensor([8], device='cuda:0'))


Now, creating the a nice classify function is easy:

In [0]:
def classify(model, name):
  name_var, length_var = NAME.process([name], device=device)
  logit = model(name_var, sequence_lengths=length_var)
  # Select the index with the maximum value
  predicted_label_id = torch.argmax(logit, dim=1).item()
  # Add 1 to the index before looking up the corresponding label (because of the <unk> thing)
  return LABEL.vocab.itos[predicted_label_id + 1]
  

Remember, our model only recognizes the nationalities that are in its label vocabulary. All names will be mapped to one of them. And of course, the model makes mistakes.

In [0]:
LABEL.vocab.itos

['<unk>',
 'English',
 'Russian',
 'Arabic',
 'Japanese',
 'German',
 'Italian',
 'Czech',
 'Spanish',
 'Dutch',
 'French',
 'Chinese',
 'Irish',
 'Greek',
 'Polish',
 'Korean',
 'Scottish',
 'Vietnamese',
 'Portuguese']

In [0]:
classify(model, "Trump")

'English'

In [0]:
classify(model, "Putin")

'Russian'

In [0]:
classify(model, "Merkel")

'German'

In [0]:
classify(model, "Macron")

'English'

In [0]:
classify(model, "Berlusconi")

'Italian'

In [0]:
classify(model, "Ronaldo")

'Italian'

In [0]:
classify(model, "Neumannova")

'Czech'

In [0]:
classify(model, "Panathinaikos")

'Greek'

In [0]:
classify(model, "Minh")

'Vietnamese'