In [None]:
%pip install wandb

Note: you may need to restart the kernel to use updated packages.


In [None]:
import wandb

wandb.login()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33martgoldman[0m (use `wandb login --relogin` to force relogin)


True

![](https://i.imgur.com/eBRPvWB.png)

# Practical PyTorch: Classifying Names with a Character-Level RNN

We will be building and training a basic character-level RNN to classify words. A character-level RNN reads words as a series of characters - outputting a prediction and "hidden state" at each step, feeding its previous hidden state into each next step. We take the final prediction to be the output, i.e. which class the word belongs to.

Specifically, we'll train on a few thousand surnames from 18 languages of origin, and predict which language a name is from based on the spelling:

```
$ python predict.py Hinton
(-0.47) Scottish
(-1.52) English
(-3.57) Irish

$ python predict.py Schmidhuber
(-0.19) German
(-2.48) Czech
(-2.68) Dutch
```

# Preparing the Data

Included in the `data/names` directory are 18 text files named as "[Language].txt". Each file contains a bunch of names, one name per line, mostly romanized (but we still need to convert from Unicode to ASCII).

We'll end up with a dictionary of lists of names per language, `{language: [names ...]}`. The generic variables "category" and "line" (for language and name in our case) are used for later extensibility.

In [None]:
!git clone https://github.com/spro/practical-pytorch

fatal: destination path 'practical-pytorch' already exists and is not an empty directory.


In [None]:
import glob

all_filenames = glob.glob('./practical-pytorch/data/names/*.txt')
print(all_filenames)

['./practical-pytorch/data/names/Scottish.txt', './practical-pytorch/data/names/Dutch.txt', './practical-pytorch/data/names/Czech.txt', './practical-pytorch/data/names/Vietnamese.txt', './practical-pytorch/data/names/Polish.txt', './practical-pytorch/data/names/Spanish.txt', './practical-pytorch/data/names/English.txt', './practical-pytorch/data/names/Arabic.txt', './practical-pytorch/data/names/Korean.txt', './practical-pytorch/data/names/Japanese.txt', './practical-pytorch/data/names/Irish.txt', './practical-pytorch/data/names/Greek.txt', './practical-pytorch/data/names/Italian.txt', './practical-pytorch/data/names/Portuguese.txt', './practical-pytorch/data/names/Chinese.txt', './practical-pytorch/data/names/French.txt', './practical-pytorch/data/names/Russian.txt', './practical-pytorch/data/names/German.txt']


In [None]:
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicode_to_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicode_to_ascii('Ślusàrski'))

Slusarski


In [None]:
# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename).read().strip().split('\n')
    return [unicode_to_ascii(line) for line in lines]

for filename in all_filenames:
    category = filename.split('/')[-1].split('.')[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)
print('n_categories =', n_categories)

n_categories = 18


Now we have `category_lines`, a dictionary mapping each category (language) to a list of lines (names). We also kept track of `all_categories` (just a list of languages) and `n_categories` for later reference.

In [None]:
print(category_lines['Italian'][:5])

['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']


# Turning Names into Tensors

Now that we have all the names organized, we need to turn them into Tensors to make any use of them.

To represent a single letter, we use a "one-hot vector" of size `<1 x n_letters>`. A one-hot vector is filled with 0s except for a 1 at index of the current letter, e.g. `"b" = <0 1 0 0 0 ...>`.

To make a word we join a bunch of those into a 2D matrix `<line_length x 1 x n_letters>`.

That extra 1 dimension is because PyTorch assumes everything is in batches - we're just using a batch size of 1 here.

In [None]:
import torch

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letter_to_tensor(letter):
    tensor = torch.zeros(1, n_letters)
    letter_index = all_letters.find(letter)
    tensor[0][letter_index] = 1
    return tensor

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def line_to_tensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        letter_index = all_letters.find(letter)
        tensor[li][0][letter_index] = 1
    return tensor

In [None]:
print(letter_to_tensor('J'))

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])


In [None]:
print(line_to_tensor('Jones').size())

torch.Size([5, 1, 57])


# Creating the Network

Before autograd, creating a recurrent neural network in Torch involved cloning the parameters of a layer over several timesteps. The layers held hidden state and gradients which are now entirely handled by the graph itself. This means you can implement a RNN in a very "pure" way, as regular feed-forward layers.

This RNN module (mostly copied from [the PyTorch for Torch users tutorial](https://github.com/pytorch/tutorials/blob/master/Introduction%20to%20PyTorch%20for%20former%20Torchies.ipynb)) is just 2 linear layers which operate on an input and hidden state, with a LogSoftmax layer after the output.

![](https://i.imgur.com/Z2xbySO.png)

In [None]:
import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=-1)
    
    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def init_hidden(self):
        return Variable(torch.zeros(1, self.hidden_size))

## Manually testing the network

With our custom `RNN` class defined, we can create a new instance:

In [None]:
n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

In [None]:
sum([p.numel() for p in rnn.parameters()])

27156

To run a step of this network we need to pass an input (in our case, the Tensor for the current letter) and a previous hidden state (which we initialize as zeros at first). We'll get back the output (probability of each language) and a next hidden state (which we keep for the next step).

Remember that PyTorch modules operate on Variables rather than straight up Tensors.

In [None]:
input = Variable(letter_to_tensor('A'))
hidden = rnn.init_hidden()

output, next_hidden = rnn(input, hidden)
print('output.size =', output.size())

output.size = torch.Size([1, 18])


For the sake of efficiency we don't want to be creating a new Tensor for every step, so we will use `line_to_tensor` instead of `letter_to_tensor` and use slices. This could be further optimized by pre-computing batches of Tensors.

In [None]:
input = Variable(line_to_tensor('Albert'))
hidden = Variable(torch.zeros(1, n_hidden))

output, next_hidden = rnn(input[0], hidden)
print(output)

tensor([[-2.8821, -2.8770, -2.8620, -2.8234, -3.0120, -2.8455, -2.8265, -3.0059,
         -2.8283, -2.9411, -2.9649, -2.9221, -2.7667, -2.8600, -2.8700, -2.9021,
         -2.9036, -2.9715]], grad_fn=<LogSoftmaxBackward0>)


As you can see the output is a `<1 x n_categories>` Tensor, where every item is the likelihood of that category (higher is more likely).

# Preparing for Training

Before going into training we should make a few helper functions. The first is to interpret the output of the network, which we know to be a likelihood of each category. We can use `Tensor.topk` to get the index of the greatest value:

In [None]:
def category_from_output(output):
    top_n, top_i = output.data.topk(1) # Tensor out of Variable with .data
    category_i = top_i[0][0]
    return all_categories[category_i], category_i

print(category_from_output(output))

('Italian', tensor(12))


We will also want a quick way to get a training example (a name and its language):

In [None]:
import random

def random_training_pair():                                                                                                               
    category = random.choice(all_categories)
    line = random.choice(category_lines[category])
    category_tensor = Variable(torch.LongTensor([all_categories.index(category)]))
    line_tensor = Variable(line_to_tensor(line))
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = random_training_pair()
    print("~~~")
    print('category =', category, '/ line =', line)
    print('category tensor =', category_tensor, '/ line tensor =', line_tensor.size())

~~~
category = Italian / line = Taverna
category tensor = tensor([12]) / line tensor = torch.Size([7, 1, 57])
~~~
category = Russian / line = Gribkov
category tensor = tensor([16]) / line tensor = torch.Size([7, 1, 57])
~~~
category = Irish / line = Cathasach
category tensor = tensor([10]) / line tensor = torch.Size([9, 1, 57])
~~~
category = Japanese / line = Takagaki
category tensor = tensor([9]) / line tensor = torch.Size([8, 1, 57])
~~~
category = Spanish / line = Rojas
category tensor = tensor([5]) / line tensor = torch.Size([5, 1, 57])
~~~
category = Russian / line = Evdokimov
category tensor = tensor([16]) / line tensor = torch.Size([9, 1, 57])
~~~
category = Japanese / line = Godo
category tensor = tensor([9]) / line tensor = torch.Size([4, 1, 57])
~~~
category = Russian / line = Verstakov
category tensor = tensor([16]) / line tensor = torch.Size([9, 1, 57])
~~~
category = Korean / line = Ryom
category tensor = tensor([8]) / line tensor = torch.Size([4, 1, 57])
~~~
category = V

In [None]:
import random

class NamesDataset(torch.utils.data.Dataset):
    def __init__(self, split, percent=0.8, shuffle=False):
        self.all_items = []
        if split == "train":
            for cat in all_categories:
                cat_len = len(category_lines[cat])
                for names in category_lines[cat][:int(cat_len*percent)]:
                    self.all_items.append([names, cat])
        else:
            for cat in all_categories:
                cat_len = len(category_lines[cat])
                for names in category_lines[cat][int(cat_len*percent):]:
                    self.all_items.append([names, cat])
        self.idxs = list(range(0, len(self.all_items)))
        if shuffle:
            random.seed(42)
            random.shuffle(self.idxs)
    
    def __len__(self):
        return len(self.all_items)

    def __getitem__(self, idx):
        cidx = self.idxs[idx]
        return self.all_items[cidx][0], self.all_items[cidx][1], line_to_tensor(self.all_items[cidx][0]), torch.LongTensor([all_categories.index(self.all_items[cidx][1])])


In [None]:
train_dset = NamesDataset("train", 0.8, True)
val_dset = NamesDataset("val", 0.8, True)
print(len(train_dset), len(val_dset))

16034 4016


In [None]:
"""
from torch.utils.data import DataLoader

def collate_batch(batch):
    label_list, text_list = [], []
    for (_text, category) in batch:
         label_list.append(all_categories.index(category))
         text_list.append(line_to_tensor(_text))
    label_list = torch.LongTensor(label_list)
    return label_list, text_list

train_loader = DataLoader(train_dset, batch_size=8, shuffle=True, collate_fn=collate_batch)

lab, txt = next(iter(train_loader))
print(lab)
print(lab.size())
print(txt.size())
""";

# Training the Network

Now all it takes to train this network is show it a bunch of examples, have it make guesses, and tell it if it's wrong.

For the [loss function `nn.NLLLoss`](http://pytorch.org/docs/nn.html#nllloss) is appropriate, since the last layer of the RNN is `nn.LogSoftmax`.

We will also create an "optimizer" which updates the parameters of our model according to its gradients. We will use the vanilla SGD algorithm with a low learning rate.

Each loop of training will:

* Create input and target tensors
* Create a zeroed initial hidden state
* Read each letter in and
    * Keep hidden state for next letter
* Compare final output to target
* Back-propagate
* Return the output and loss

In [None]:
def train(rnn, dset, opt, to_shuffle=False):
    rnn.train()
    tot_loss, tot_acc = [], []
    idx_map = list(range(0, len(dset)))
    if to_shuffle:
        random.shuffle(idx_map)
    for i in range(len(dset)):
        _, true_label, line_tensor, category_tensor = dset[idx_map[i]]
        rnn.zero_grad()
        hidden = rnn.init_hidden()
        
        for i in range(line_tensor.size()[0]):
            output, hidden = rnn(line_tensor[i], hidden)

        loss = criterion(output, category_tensor)
        loss.backward()
        tot_loss.append(loss.item())

        opt.step()

        guess, guess_i = category_from_output(output)
        tot_acc.append(guess == true_label)

    return sum(tot_acc)/len(tot_acc), sum(tot_loss)/len(tot_loss)

def validate(rnn, dset):
    rnn.eval()
    tot_loss, tot_acc = [], []
    with torch.no_grad():
        for i in range(len(dset)):
            _, true_label, line_tensor, category_tensor = dset[i]
            hidden = rnn.init_hidden()
            
            for i in range(line_tensor.size()[0]):
                output, hidden = rnn(line_tensor[i], hidden)

            loss = criterion(output, category_tensor)
            tot_loss.append(loss.item())

            guess, guess_i = category_from_output(output)
            tot_acc.append(guess == true_label)

    return sum(tot_acc)/len(tot_acc), sum(tot_loss)/len(tot_loss)

@torch.no_grad()
def get_grad_norm(model, norm_type=2):
    parameters = model.parameters()
    if isinstance(parameters, torch.Tensor):
        parameters = [parameters]
    parameters = [p for p in parameters if p.grad is not None]
    total_norm = torch.norm(
        torch.stack(
            [torch.norm(p.grad.detach(), norm_type).cpu() for p in parameters]
        ),
        norm_type,
    )
    return total_norm.item()

Now we just have to run that with a bunch of examples. Since the `train` function returns both the output and loss we can print its guesses and also keep track of loss for plotting. Since there are 1000s of examples we print only every `print_every` time steps, and take an average of the loss.

In [None]:
import time
import math
from tqdm import tqdm



def time_since(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


In [None]:

n_epochs = 15
# Keep track of losses for plotting
current_loss = 0
all_losses = []

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

criterion = nn.NLLLoss()
learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn
optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)


start = time.time()
random.seed(74)
wandb.init(project="FullGD-HSE-RS", name = 'SGD_name_clf')
for epoch in tqdm(range(1, n_epochs + 1)):
    # Get a random training input and target
    train_acc, train_loss = train(rnn, train_dset, optimizer, to_shuffle=True)
    #print("Train. Epoch:{}".format(epoch), train_acc, train_loss)
    gr_norm = get_grad_norm(rnn)
    val_acc, val_loss = validate(rnn, val_dset)
    #print("Val. Epoch:{}".format(epoch), val_acc, val_loss)
    wandb.log({"grad_norm": gr_norm, "train_loss": train_loss, "train_acc": train_acc,
               "val_loss": val_loss, "val_acc": val_acc})
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33martgoldman[0m (use `wandb login --relogin` to force relogin)


Train. Epoch:1 0.5817637520269427 1.4325029731270245
Val. Epoch:1 0.6446713147410359 1.2376231070510733
Train. Epoch:2 0.67718597979294 1.1163320765908649
Val. Epoch:2 0.6469123505976095 1.172873374953726
Train. Epoch:3 0.7059997505301235 1.0077862262412798
Val. Epoch:3 0.6568725099601593 1.108074760873084
Train. Epoch:4 0.7215292503430211 0.945529555651628
Val. Epoch:4 0.6536354581673307 1.1424193425953715
Train. Epoch:5 0.7333790694773606 0.9069633339827332
Val. Epoch:5 0.6685756972111554 1.1032542189299768
Train. Epoch:6 0.7356866658351005 0.8796857736479473
Val. Epoch:6 0.6700697211155379 1.0913668926372324
Train. Epoch:7 0.7394287139827865 0.8589963031818836
Val. Epoch:7 0.6750498007968128 1.0379622985193777
Train. Epoch:8 0.7462267681177498 0.844160728951868
Val. Epoch:8 0.6807768924302788 1.06311922014982
Train. Epoch:9 0.7482848945989772 0.8314768199335949
Val. Epoch:9 0.6835159362549801 1.075138739001542
Train. Epoch:10 0.751216165647998 0.8150038480760797
Val. Epoch:10 0.6862

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
grad_norm,▃▁█▇▄▄▂▃▁▅▅█▁▃█
train_acc,▁▅▆▇▇▇▇████████
train_loss,█▅▃▃▂▂▂▂▂▁▁▁▁▁▁
val_acc,▁▁▃▂▄▄▅▆▆▆▃▄▆██
val_loss,█▆▄▅▄▄▂▃▃▂▇▅▂▁▂

0,1
grad_norm,13.85876
train_acc,0.75776
train_loss,0.77932
val_acc,0.69397
val_loss,1.04287


# Full GD

In [None]:
def train_full(rnn, dset, opt, to_shuffle=False, grad_clip=None):
    rnn.train()
    tot_loss, tot_acc = [], []
    idx_map = list(range(0, len(dset)))
    if to_shuffle:
        random.shuffle(idx_map)
    rnn.zero_grad()
    for i in range(len(dset)):
        _, true_label, line_tensor, category_tensor = dset[idx_map[i]]
        
        hidden = rnn.init_hidden()
        
        for i in range(line_tensor.size()[0]):
            output, hidden = rnn(line_tensor[i], hidden)

        loss = criterion(output, category_tensor)
        loss.backward()
        tot_loss.append(loss.item())

        

        guess, guess_i = category_from_output(output)
        tot_acc.append(guess == true_label)
    if grad_clip is not None:
        torch.nn.utils.clip_grad_norm_(rnn.parameters(), grad_clip)
    opt.step()
    return sum(tot_acc)/len(tot_acc), sum(tot_loss)/len(tot_loss)

In [None]:
n_epochs = 100
# Keep track of losses for plotting
current_loss = 0
all_losses = []

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_categories)

criterion = nn.NLLLoss()
learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn
optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)


start = time.time()
random.seed(74)
wandb.init(project="FullGD-HSE-RS", name = 'FullGD_name_clf')
for epoch in tqdm(range(1, n_epochs + 1)):
    # Get a random training input and target
    train_acc, train_loss = train_full(rnn, train_dset, optimizer, to_shuffle=True, grad_clip=200)
    #print("Train. Epoch:{}".format(epoch), train_acc, train_loss)
    gr_norm = get_grad_norm(rnn)
    val_acc, val_loss = validate(rnn, val_dset)
    #print("Val. Epoch:{}".format(epoch), val_acc, val_loss)
    wandb.log({"grad_norm": gr_norm, "train_loss": train_loss, "train_acc": train_acc,
               "val_loss": val_loss, "val_acc": val_acc})
wandb.finish()

100%|██████████| 100/100 [20:32<00:00, 12.32s/it]


VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
grad_norm,▄▆▂▆▃▂▃▄▅▅▄▄▅▅▄▂▂▅▆▃▄▁▃▅▅▂▃▅▄▃▄▃▅▄▆▁▇▅▂█
train_acc,▁▆▆▅▆▆▇▇▇▇▇▇▇▇▇▇███▇███▇████████▇███████
train_loss,▅█▃▃▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val_acc,▃▃▃▃▅▆▄▄▅▄▇▆▁█▆▆▇▃█▇▇▇▇▇▇▇█▇▇▇█▆█▇█▇▇█▇█
val_loss,█▇▅▅▄▄▄▃▄▃▂▂▇▂▃▂▂▆▂▃▂▂▂▁▁▁▁▂▁▁▁▃▁▁▁▂▁▁▂▁

0,1
grad_norm,200.00041
train_acc,0.65779
train_loss,1.10358
val_acc,0.65588
val_loss,1.15978
