<a href="https://colab.research.google.com/github/akaver/NLP2019/blob/master/Lab14_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Improving training

In this lab, we will learn a few techniques to improve training neural nets. Most of the methods are not specific to speech and language, but can be used for any task where DNNs are used.

The methods that will be covered are:
 

*   Residual connections in DNNs
*  Label smoothing
*   Learning rate scheduling


We will use the accent identification task as an example. 


We will use Pytorch to classify English speech segments according the native accent.

Let's first download the (toy) data.




In [0]:
!rm -f accent_subset_wav.zip
!wget https://phon.ioc.ee/~tanela/accent_subset_wav.zip

--2019-04-30 10:06:32--  https://phon.ioc.ee/~tanela/accent_subset_wav.zip
Resolving phon.ioc.ee (phon.ioc.ee)... 193.40.251.126
Connecting to phon.ioc.ee (phon.ioc.ee)|193.40.251.126|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 503722423 (480M) [application/zip]
Saving to: ‘accent_subset_wav.zip’


2019-04-30 10:07:14 (11.7 MB/s) - ‘accent_subset_wav.zip’ saved [503722423/503722423]



The data consists of wav files, and the filename says what accent the person had. The data originates from the CMU Speech Accent Database (http://accent.gmu.edu/).

All speakers read the same passage: "Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station."

There are even some Estonian speakers in the database, you can listen to their speech here: http://accent.gmu.edu/browse_language.php?function=find&language=estonian. But in our experiments, we use only three accent groups: English (i.e., no accent), Arabic and Mandarin.

In [0]:
! rm -rf data
! mkdir -p data
! unzip -q accent_subset_wav.zip -d data

In [0]:
! ls data/ | head

arabic100.wav
arabic101.wav
arabic102.wav
arabic10.wav
arabic11.wav
arabic12.wav
arabic13.wav
arabic14.wav
arabic15.wav
arabic16.wav


The dataset is highly skewed: there are much more native English speakers than Arabic and Mandarin speakers:

In [0]:
! ls data/english*.wav | wc -l
! ls data/arabic*.wav | wc -l
! ls data/mandarin*.wav | wc -l

579
102
65


First, we will learn how to extract features from the audio. We will use the python_speech_features package for that.

In [0]:
! pip install python_speech_features

Collecting python_speech_features
  Downloading https://files.pythonhosted.org/packages/ff/d1/94c59e20a2631985fbd2124c45177abaa9e0a4eee8ba8a305aa26fc02a8e/python_speech_features-0.6.tar.gz
Building wheels for collected packages: python-speech-features
  Building wheel for python-speech-features (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/3c/42/7c/f60e9d1b40015cd69b213ad90f7c18a9264cd745b9888134be
Successfully built python-speech-features
Installing collected packages: python-speech-features
Successfully installed python-speech-features-0.6


In [0]:
from python_speech_features import logfbank, mfcc
import scipy.io.wavfile as wav


I randomly split the dataset into train and dev set. Let's download the split definitions:

In [0]:
! rm -f accent_trainset.txt accent_devset.txt
! wget -q  https://phon.ioc.ee/~tanela/tmp/accent_trainset.txt https://phon.ioc.ee/~tanela/tmp/accent_devset.txt

The files just list all the IDs in the train and dev set:

In [0]:
! head accent_devset.txt

english101
mandarin11
english229
arabic7
english466
english516
mandarin55
english42
english349
english367


In [0]:
import torch
from torch.utils.data import Dataset, DataLoader

In [0]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


Let's define a class that can read and represent a dataset.

In this lab, we will oonly use the 1st 10 seconds of every file, because handling audio files of different lenght is a bit complicated.

In [0]:
class AccentDataset(Dataset):
    def __init__(self, ids_file, root_dir):
               
        self.data = []
        ids = [l.strip() for l in open(ids_file).readlines()]
        for i, id in enumerate(ids):
          print("\rReading %d-th file" % i, end="")
          language_id = 0  
          if id.startswith("english"):
            language_id = 1
          elif id.startswith("mandarin"):
            language_id = 2
          (rate,audio) = wav.read("data/%s.wav" % id)
          assert rate == 16000
          # we use only the first 1000 feature vectors, i-e., the 1st 10 seconds of the audio         
          features = logfbank(audio, 16000)[0:1000]
          feat_tensor = torch.FloatTensor(features)
          # each member in the dataset is a tuple of (features, language_id)
          self.data.append((feat_tensor, language_id))
      
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

In [0]:
train_dataset = AccentDataset("accent_trainset.txt", root_dir="data")

Reading 675-th file

In [0]:
dev_dataset = AccentDataset("accent_devset.txt", root_dir="data")

Reading 69-th file

In [0]:
train_iter = DataLoader(train_dataset, batch_size=32,  shuffle=True)
dev_iter = DataLoader(dev_dataset, batch_size=32,  shuffle=False)


Now we can define our model.

Note how the CNN model is very similar to the model we used for text classification. Only here we don't have to use word embeddings, because the speech features already look a bit like embeddings.

In [0]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import sys

In [0]:
class AccentCnn(nn.Module):
  
  def __init__(self, num_classes, feature_dim, dropout_prob=0.2):
    super(AccentCnn, self).__init__()
    self.input_bn = nn.BatchNorm1d(feature_dim)
    self.conv1 = nn.Conv1d(feature_dim, 32, kernel_size=5, stride=1)
    self.conv2 = nn.Conv1d(32, 64, kernel_size=5, stride=1)
    self.conv3 = nn.Conv1d(64, 64, kernel_size=3, stride=1)
    self.conv4 = nn.Conv1d(64, 64, kernel_size=3, stride=1)
    
    
    self.dropout = nn.Dropout(dropout_prob)
    self.fc = nn.Linear(64, num_classes)
    
  def forward(self, x):
    # Conv1d takes in (batch, channels, seq_len), but raw signal is (batch, seq_len, channels)
    x = x.permute(0, 2, 1).contiguous()
    x = self.input_bn(x)
    #print(x.shape)
    x = F.relu(self.conv1(x))
    x = F.max_pool1d(x, 2)
    #print(x.shape)
    x = F.relu(self.conv2(x))
    x = F.max_pool1d(x, 2)
    #print(x.shape)
    x = F.relu(self.conv3(x))
    x = F.max_pool1d(x, 2)
    #print(x.shape)
    
    x = F.relu(self.conv4(x))
    x = F.max_pool1d(x, x.size(2))
    #print(x.shape)
    x = x.view(-1, 64)
    #print(x.shape)
    x = self.dropout(x) 
    logit = self.fc(x)
    return logit

In [0]:
def train(model, num_epochs, train_iter, tdev_iter, device, log_interval=10):

  optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

  steps = 0
  best_acc = 0
  last_step = 0
  
  for epoch in range(1, num_epochs+1):
    print("Starting epoch %d" % epoch)
    for batch in train_iter:
      # set training mode
      model.train()
      fbank, target = batch
      fbank, target = fbank.to(device), target.to(device)
      

      optimizer.zero_grad()
      logit = model(fbank)

      loss = F.cross_entropy(logit, target)
      loss.backward()
      optimizer.step()

    train_acc = evaluate("train", train_iter, model)                
    dev_acc = evaluate("dev", dev_iter, model)                

def evaluate(name, data_iter, model):
  # set evaluation mode (turns off dropout)
  model.eval()
  corrects, avg_loss = 0, 0
  for batch in data_iter:
    fbank, target = batch
    fbank, target = fbank.to(device), target.to(device)
    
    logit = model(fbank)
    loss = F.cross_entropy(logit, target,  reduction='sum')

    avg_loss += loss.item()
    corrects += (torch.max(logit, 1)[1].view(target.size()).data == target.data).sum()

  size = len(data_iter.dataset)
  avg_loss /= size
  accuracy = 100.0 * float(corrects)/size
  print('  Evaluation on {} - loss: {:.6f}  acc: {:.4f}%({}/{})'.format(name,
                                                                        avg_loss, 
                                                                        accuracy, 
                                                                        corrects, 
                                                                        size))
  return accuracy, avg_loss

In [0]:
model = AccentCnn(3, train_dataset[0][0].shape[1]).to(device)
train(model, 20, train_iter, dev_iter, device=device)

Starting epoch 1
Evaluation on train - loss: 0.723732  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.587152  acc: 81.4286%(57/70)
Starting epoch 2
Evaluation on train - loss: 0.718906  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.595017  acc: 81.4286%(57/70)
Starting epoch 3
Evaluation on train - loss: 0.668370  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.567475  acc: 81.4286%(57/70)
Starting epoch 4
Evaluation on train - loss: 0.654811  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.568043  acc: 81.4286%(57/70)
Starting epoch 5
Evaluation on train - loss: 0.655934  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.579484  acc: 81.4286%(57/70)
Starting epoch 6
Evaluation on train - loss: 0.638122  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.563410  acc: 81.4286%(57/70)
Starting epoch 7
Evaluation on train - loss: 0.626791  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.559641  acc: 81.4286%(57/70)
Starting epoch 8
Evaluation on train - loss: 0.614721  

Note how the model reached accuracy of 81% already after the 1st epoch. But let's not fool ourselves. The dataset is highly skewed, and therefore 57 speakers in the dev data are English natives. So, the model quickly learned that it's a good idea to classify everybody as English, and this gives us already 81%. Of course, we don't need a complex neural network model to do that. We can see that the final model classifies only a few other items besides the 57 English ones correctly. This is not really surprising, because the dataset is rather small and classifying speech is more difficult than classifying text, where certian words can act as very good hints for classification.


## Residual connections

(Based on https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035)

According to the universal approximation theorem, given enough capacity, we know that a feedforward network with a single layer is sufficient to represent any function. However, the layer might be massive and the network is prone to overfitting the data. Therefore, there is a common trend in the research community that our network architecture needs to go deeper.

However, increasing network depth does not work by simply stacking layers together. Deep networks are hard to train because of the notorious vanishing gradient problem — as the gradient is back-propagated to earlier layers, repeated multiplication may make the gradient infinitively small. As a result, as the network goes deeper, its performance gets saturated or even starts degrading rapidly.

The core idea of ResNet is introducing a so-called “identity shortcut connection” that skips one or more layers, as shown in the following figure:

![Resnet](https://cdn-images-1.medium.com/max/1200/1*ByrVJspW-TefwlH7OLxNkg.png)

The authors argue that stacking layers shouldn’t degrade the network performance, because we could simply stack identity mappings (layer that doesn’t do anything) upon the current network, and the resulting architecture would perform the same. This indicates that the deeper model should not produce a training error higher than its shallower counterparts. They hypothesize that letting the stacked layers fit a residual mapping is easier than letting them directly fit the desired underlaying mapping. And the residual block above explicitly allows it to do precisely that.

Nowadays, there are many variant of ResNets (see below), we will only implement the original one.

![Resnet variants](https://cdn-images-1.medium.com/max/1600/1*M5NIelQC33eN6KjwZRccoQ.png)

In [0]:
class ResidualBlock(nn.Module):

    def __init__(self, dim):
        super(ResidualBlock, self).__init__()
        
        self.conv1 = nn.Conv1d(dim, dim, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm1d(dim)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv1d(dim, dim, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm1d(dim)
        

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        # add the original input and the transformed input together
        out += residual
        out = self.relu(out)

        return out

In [0]:
class AccentResNet(nn.Module):
  
  def __init__(self, num_classes, feature_dim, dropout_prob=0.2):
    super(AccentResNet, self).__init__()
    self.layers = nn.Sequential(
        # First, apply a simple Convnet to raw input
        nn.BatchNorm1d(feature_dim),
        nn.Conv1d(feature_dim, 32, kernel_size=5, stride=1),
        nn.ReLU(inplace=True),
        
        # Now, apply the first residual block
        ResidualBlock(32),
        
        # Just increase the number of filters using a a convolution with kernel size 1
        nn.Conv1d(32, 48, kernel_size=1, stride=1),
        # And compress the signmal in time
        nn.MaxPool1d(kernel_size=2),
        
        # Now, apply the second residual block
        ResidualBlock(48),

        nn.Conv1d(48, 64, kernel_size=1, stride=1),
        nn.MaxPool1d(kernel_size=2),
        
        # Now, apply the third residual block
        ResidualBlock(64),        
    )
    
    self.dropout = nn.Dropout(dropout_prob)
    self.fc = nn.Linear(64, num_classes)
    
  def forward(self, x):
    # Conv1d takes in (batch, channels, seq_len), but raw signal is (batch, seq_len, channels)
    x = x.permute(0, 2, 1).contiguous()
    x = self.layers(x)
  
    
    # global max
    x = F.avg_pool1d(x, x.size(2))
    #print(x.shape)
    x = x.view(-1, 64)
    #print(x.shape)
    x = self.dropout(x) 
    logit = self.fc(x)
    return logit

In [0]:
model_resnet = AccentResNet(3, train_dataset[0][0].shape[1]).to(device)

In [0]:
model_resnet.forward(next(iter(dev_iter))[0].to(device))

tensor([[ 0.3262,  0.2829, -0.1772],
        [ 0.0474,  0.1098, -0.1222],
        [ 0.0927,  0.1707, -0.0980],
        [ 0.2885,  0.1500, -0.0885],
        [ 0.0090,  0.2199, -0.0631],
        [ 0.1040,  0.3576, -0.0077],
        [ 0.1578,  0.2775, -0.0416],
        [ 0.1955,  0.1735, -0.1354],
        [ 0.0630,  0.1884, -0.0043],
        [ 0.1822,  0.3294, -0.1233],
        [ 0.1563,  0.2607, -0.1367],
        [ 0.0560,  0.3536, -0.1032],
        [ 0.1126,  0.1777,  0.0370],
        [ 0.0725,  0.0604,  0.0303],
        [ 0.0626,  0.3047, -0.1097],
        [ 0.1962,  0.1891,  0.0445],
        [ 0.2630,  0.2101, -0.1371],
        [ 0.1929,  0.3090, -0.2444],
        [ 0.1634,  0.0663,  0.0884],
        [ 0.2306,  0.1914, -0.0806],
        [ 0.2364,  0.3678, -0.1532],
        [ 0.0883,  0.4016, -0.1731],
        [ 0.1225, -0.0120,  0.0658],
        [ 0.0733,  0.1131,  0.0288],
        [ 0.2509,  0.2361,  0.0050],
        [ 0.1716,  0.2496, -0.2303],
        [ 0.1247,  0.2450, -0.2005],
 

In [0]:
model_resnet = AccentResNet(3, train_dataset[0][0].shape[1]).to(device)
train(model_resnet, 20, train_iter, dev_iter, device=device)

Starting epoch 1
Evaluation on train - loss: 0.722397  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.658993  acc: 81.4286%(57/70)
Starting epoch 2
Evaluation on train - loss: 0.647977  acc: 77.8107%(526/676)
Evaluation on dev - loss: 0.463318  acc: 80.0000%(56/70)
Starting epoch 3
Evaluation on train - loss: 0.607394  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.486558  acc: 81.4286%(57/70)
Starting epoch 4
Evaluation on train - loss: 0.558641  acc: 78.5503%(531/676)
Evaluation on dev - loss: 0.428126  acc: 81.4286%(57/70)
Starting epoch 5
Evaluation on train - loss: 0.574249  acc: 78.2544%(529/676)
Evaluation on dev - loss: 0.449939  acc: 81.4286%(57/70)
Starting epoch 6
Evaluation on train - loss: 0.534179  acc: 79.7337%(539/676)
Evaluation on dev - loss: 0.442328  acc: 85.7143%(60/70)
Starting epoch 7
Evaluation on train - loss: 0.508158  acc: 82.1006%(555/676)
Evaluation on dev - loss: 0.332880  acc: 88.5714%(62/70)
Starting epoch 8
Evaluation on train - loss: 0.508253  

## Label smoothing

When we apply the cross-entropy loss to a classification task, we're expecting true labels to have 1, while the others 0. In other words, we have no doubts that the true labels are true, and the others are not. Is that always true? Maybe not. Many manual annotations are the results of multiple participants. They might have different criteria. They might make some mistakes. They are human, after all. As a result, the ground truth labels we have had perfect beliefs on are possible wrong.

One possibile solution to this is to relax our confidence on the labels. For instance, we can slighly lower the loss target values from 1 to, say, 0.9. And naturally we increase the target value of 0 for the others slightly as such. This idea is called label smoothing.

In [0]:
def train_with_label_smoothing(model, num_epochs, train_iter, test_iter, device,  label_smoothing=0.1, log_interval=10):

  assert (label_smoothing >= 0.0 and label_smoothing <= 1.0)
  
  # Each non-target class gets a target probability of label_smoothing / 2.0
  non_target_prob = label_smoothing / 2.0
    
  optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

  steps = 0
  best_acc = 0
  last_step = 0
  
  criterion = nn.KLDivLoss(reduction='sum')
  
  for epoch in range(1, num_epochs+1):
    print("Starting epoch %d" % epoch)
    for batch in train_iter:
      # set training mode
      model.train()
      fbank, target = batch
      fbank, target = fbank.to(device), target.to(device)
      batch_size = target.shape[0]
      
      # Create a one-hot encoded target matrix of size (batch_size, 3)
      # where the values corresponding to correct labelsa re 1.0-label_smoothing
      # and everywhere else there is non_target_prob (equal to 0.05, if label_smoothing is 0.1)
      target_one_hot = torch.ones(batch_size, 3).to(device)      
      target_one_hot *= non_target_prob
      target_one_hot[torch.arange(batch_size, dtype=torch.int64).to(device), target] = 1.0 - label_smoothing
      

      optimizer.zero_grad()
      logit = model(fbank)
      log_probabilities = F.log_softmax(logit)
      loss = criterion(log_probabilities, target_one_hot).mean()
      loss.backward()
      optimizer.step()


    train_acc = evaluate("train", train_iter, model)                
    dev_acc = evaluate("dev", dev_iter, model)                


In [0]:
model_resnet = AccentResNet(3, train_dataset[0][0].shape[1]).to(device)
train_with_label_smoothing(model_resnet, 30, train_iter, dev_iter, device=device, label_smoothing=0.2)

Starting epoch 1




Evaluation on train - loss: 0.893660  acc: 77.2189%(522/676)
Evaluation on dev - loss: 0.865731  acc: 81.4286%(57/70)
Starting epoch 2
Evaluation on train - loss: 0.770791  acc: 77.9586%(527/676)
Evaluation on dev - loss: 0.710623  acc: 81.4286%(57/70)
Starting epoch 3
Evaluation on train - loss: 0.635451  acc: 79.4379%(537/676)
Evaluation on dev - loss: 0.573594  acc: 82.8571%(58/70)
Starting epoch 4
Evaluation on train - loss: 0.610297  acc: 80.6213%(545/676)
Evaluation on dev - loss: 0.581662  acc: 82.8571%(58/70)
Starting epoch 5
Evaluation on train - loss: 0.560857  acc: 81.8047%(553/676)
Evaluation on dev - loss: 0.505400  acc: 85.7143%(60/70)
Starting epoch 6
Evaluation on train - loss: 0.566048  acc: 82.9882%(561/676)
Evaluation on dev - loss: 0.502643  acc: 88.5714%(62/70)
Starting epoch 7
Evaluation on train - loss: 0.507788  acc: 80.6213%(545/676)
Evaluation on dev - loss: 0.453049  acc: 84.2857%(59/70)
Starting epoch 8
Evaluation on train - loss: 0.590833  acc: 84.9112%(574

## Learning rate scheduling

(Based on https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)

When training deep neural networks, it is often useful to reduce learning rate as the training progresses. This can be done by using pre-defined learning rate schedules or adaptive learning rate methods.

### Learning Rate Schedules
Learning rate schedules seek to adjust the learning rate during training by reducing the learning rate according to a pre-defined schedule. Common learning rate schedules include time-based decay, step decay and exponential decay.

Step decay schedule drops the learning rate by a factor every epoch, or evry few epochs. It's easy to implement using Pytorch's builtin class, as seen below.


In [0]:
def train_with_label_smoothing_lr_schedule(model, num_epochs, train_iter, test_iter, device,  label_smoothing=0.1, lr_decay=0.5, log_interval=10):

  assert (label_smoothing >= 0.0 and label_smoothing <= 1.0)
  
  # Each non-target class gets a target probability of label_smoothing / 2.0
  non_target_prob = label_smoothing / 2.0
  
  # Usually, SGD is used instead of Adam, when learning rate scheduling is used
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
  
  # every 2 epochs, we will multiply the learning rate by 0.5
  lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=lr_decay)

  steps = 0
  best_acc = 0
  last_step = 0
  
  criterion = nn.KLDivLoss(reduction='sum')
  
  for epoch in range(1, num_epochs+1):
    lr_scheduler.step()
    print("Starting epoch %d, learning rate is %f" % (epoch, lr_scheduler.get_lr()[0]))
    for batch in train_iter:
      # set training mode
      model.train()
      fbank, target = batch
      fbank, target = fbank.to(device), target.to(device)
      batch_size = target.shape[0]
      
      # Create a one-hot encoded target matrix of size (batch_size, 3)
      # where the values corresponding to correct labelsa re 1.0-label_smoothing
      # and everywhere else there is non_target_prob (equal to 0.05, if label_smoothing is 0.1)
      target_one_hot = torch.ones(batch_size, 3).to(device)      
      target_one_hot *= non_target_prob
      target_one_hot[torch.arange(batch_size, dtype=torch.int64).to(device), target] = 1.0 - label_smoothing
      

      optimizer.zero_grad()
      logit = model(fbank)
      log_probabilities = F.log_softmax(logit)
      loss = criterion(log_probabilities, target_one_hot).mean()
      loss.backward()
      optimizer.step()

    train_acc = evaluate("train", train_iter, model)                
    dev_acc = evaluate("dev", dev_iter, model)       

In [0]:
model_resnet = AccentResNet(3, train_dataset[0][0].shape[1]).to(device)
train_with_label_smoothing_lr_schedule(model_resnet, 30, train_iter, dev_iter, device=device, label_smoothing=0.2, lr_decay=0.9)

Starting epoch 1, learning rate is 0.010000




  Evaluation on train - loss: 0.790495  acc: 77.2189%(522/676)
  Evaluation on dev - loss: 0.734411  acc: 81.4286%(57/70)
Starting epoch 2, learning rate is 0.009000
  Evaluation on train - loss: 0.762367  acc: 78.4024%(530/676)
  Evaluation on dev - loss: 0.723188  acc: 82.8571%(58/70)
Starting epoch 3, learning rate is 0.008100
  Evaluation on train - loss: 1.033133  acc: 47.9290%(324/676)
  Evaluation on dev - loss: 0.905768  acc: 60.0000%(42/70)
Starting epoch 4, learning rate is 0.007290
  Evaluation on train - loss: 0.673452  acc: 78.9941%(534/676)
  Evaluation on dev - loss: 0.623219  acc: 78.5714%(55/70)
Starting epoch 5, learning rate is 0.006561
  Evaluation on train - loss: 0.591710  acc: 78.8462%(533/676)
  Evaluation on dev - loss: 0.505518  acc: 80.0000%(56/70)
Starting epoch 6, learning rate is 0.005905
  Evaluation on train - loss: 0.745496  acc: 78.2544%(529/676)
  Evaluation on dev - loss: 0.674897  acc: 84.2857%(59/70)
Starting epoch 7, learning rate is 0.005314
  Ev

### Dynamic learning rate scheduling

Another option is to use dynamic rate scheduling: decrease the learning rate every time model performance on dev data stops improving (i.e., when a "plateau" is reached. 


In [0]:
def train_with_label_smoothing_adaptive_lr_schedule(model, num_epochs, train_iter, test_iter, device,  label_smoothing=0.1, lr_decay=0.5, log_interval=10):

  assert (label_smoothing >= 0.0 and label_smoothing <= 1.0)
  
  # Each non-target class gets a target probability of label_smoothing / 2.0
  non_target_prob = label_smoothing / 2.0
  
  # Usually, SGD is used instead of Adam, when learning rate scheduling is used
  optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
  
  # every 2 epochs, we will multiply the learning rate by 0.5
  lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=lr_decay, patience=1, verbose=True)

  steps = 0
  best_acc = 0
  last_step = 0
  
  criterion = nn.KLDivLoss(reduction='sum')
  
  for epoch in range(1, num_epochs+1):
    print("Starting epoch %d" % epoch)
    for batch in train_iter:
      # set training mode
      model.train()
      fbank, target = batch
      fbank, target = fbank.to(device), target.to(device)
      batch_size = target.shape[0]
      
      # Create a one-hot encoded target matrix of size (batch_size, 3)
      # where the values corresponding to correct labelsa re 1.0-label_smoothing
      # and everywhere else there is non_target_prob (equal to 0.05, if label_smoothing is 0.1)
      target_one_hot = torch.ones(batch_size, 3).to(device)      
      target_one_hot *= non_target_prob
      target_one_hot[torch.arange(batch_size, dtype=torch.int64).to(device), target] = 1.0 - label_smoothing
      

      optimizer.zero_grad()
      logit = model(fbank)
      log_probabilities = F.log_softmax(logit)
      loss = criterion(log_probabilities, target_one_hot).mean()
      
      loss.backward()
      optimizer.step()
      
    train_acc, train_loss = evaluate("train", train_iter, model)                
    dev_acc, dev_loss = evaluate("dev", dev_iter, model)       
    lr_scheduler.step(dev_loss)
    

In [0]:
model_resnet = AccentResNet(3, train_dataset[0][0].shape[1]).to(device)
train_with_label_smoothing_adaptive_lr_schedule(model_resnet, 30, train_iter, dev_iter, device=device, label_smoothing=0.2, lr_decay=0.5)

Starting epoch 1




  Evaluation on train - loss: 0.874490  acc: 77.2189%(522/676)
  Evaluation on dev - loss: 0.838321  acc: 81.4286%(57/70)
Starting epoch 2
  Evaluation on train - loss: 0.681129  acc: 77.3669%(523/676)
  Evaluation on dev - loss: 0.632447  acc: 81.4286%(57/70)
Starting epoch 3
  Evaluation on train - loss: 0.600213  acc: 77.5148%(524/676)
  Evaluation on dev - loss: 0.472569  acc: 80.0000%(56/70)
Starting epoch 4
  Evaluation on train - loss: 0.555386  acc: 77.5148%(524/676)
  Evaluation on dev - loss: 0.468457  acc: 81.4286%(57/70)
Starting epoch 5
  Evaluation on train - loss: 0.555806  acc: 77.3669%(523/676)
  Evaluation on dev - loss: 0.472371  acc: 80.0000%(56/70)
Starting epoch 6
  Evaluation on train - loss: 0.640534  acc: 77.3669%(523/676)
  Evaluation on dev - loss: 0.592279  acc: 84.2857%(59/70)
Epoch     5: reducing learning rate of group 0 to 5.0000e-03.
Starting epoch 7
  Evaluation on train - loss: 0.565086  acc: 80.0296%(541/676)
  Evaluation on dev - loss: 0.500273  acc