In [3]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [1]:
import torch
import torchaudio

# Part 2 : Fine tuning

## Exercice 1 : Phone separability with aligned phonemes.

One option to evaluate the quality of the features trained with CPC can be to check if they can be used to recognize phonemes. 
To do so, we can fine-tune a pre-trained model using a limited amount of labelled speech data.
We are going to start with a simple evaluation setting where we have the phone labels for each timestep corresponding to a CPC feature.

We will work with a model already pre-trained on English data. As far as the fine-tuning dataset is concerned, we will use a 1h subset of [librispeech-100](http://www.openslr.org/12/). 

In [19]:
!mkdir checkpoint_data
!wget https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_30.pt -P checkpoint_data
!wget https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_logs.json -P checkpoint_data
!wget https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_args.json -P checkpoint_data
!ls checkpoint_data

--2020-06-29 10:11:59--  https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_30.pt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113599715 (108M) [application/octet-stream]
Saving to: ‘checkpoint_data/checkpoint_30.pt’


2020-06-29 10:12:10 (10.5 MB/s) - ‘checkpoint_data/checkpoint_30.pt’ saved [113599715/113599715]

--2020-06-29 10:12:12--  https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_logs.json
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20786 (20K) [text/plain]
Saving to: ‘c

In [15]:
%cd /content/CPC_audio
from cpc.dataset import parseSeqLabels
from cpc.feature_loader import loadModel

checkpoint_path = 'checkpoint_data/checkpoint_30.pt'
cpc_model, HIDDEN_CONTEXT_MODEL, HIDDEN_ENCODER_MODEL = loadModel([checkpoint_path])
cpc_model = cpc_model.cuda()
label_dict, N_PHONES = parseSeqLabels('/content/converted_aligned_phones.txt')
dataset_train = load_dataset('/content/train_data', file_extension='.flac', phone_label_dict=label_dict)
dataset_val = load_dataset('/content/val_data', file_extension='.flac', phone_label_dict=label_dict)
data_loader_train = dataset_train.getDataLoader(BATCH_SIZE, "speaker", True)
data_loader_val = dataset_val.getDataLoader(BATCH_SIZE, "sequence", False)

/content/CPC_audio
Loading checkpoint checkpoint_data/checkpoint_30.pt
Loading the state dict at checkpoint_data/checkpoint_30.pt


162it [00:00, 2417.18it/s]

Saved cache file at /content/train_data/_seqs_cache.txt





Checking length...


1430it [00:00, 1117126.97it/s]


Done, elapsed: 0.274 seconds
Scanned 1430 sequences in 0.27 seconds
1 chunks computed
Joining pool
Joined process, elapsed=7.480 secs


169it [00:00, 662.54it/s]


Saved cache file at /content/val_data/_seqs_cache.txt
Checking length...


316it [00:00, 60357.94it/s]


Done, elapsed: 0.875 seconds
Scanned 316 sequences in 0.88 seconds
1 chunks computed
Joining pool
Joined process, elapsed=2.434 secs


In [16]:
??cpc_model

Then we will use a simple linear classifier to recognize the phonemes from the features produced by ```cpc_model```. 

### a) Build the phone classifier 

Design a class of linear classifiers, ```PhoneClassifier``` that will take as input a batch of sequences of CPC features and output a score vector for each phoneme

In [20]:
class PhoneClassifier(torch.nn.Module):

  def __init__(self,
               input_dim : int,
               n_phones : int):
    super(PhoneClassifier, self).__init__()
    self.linear = torch.nn.Linear(input_dim, n_phones)
    

  def forward(self, x):
    return self.linear(x)

Our phone classifier will then be:

In [None]:
phone_classifier = PhoneClassifier(HIDDEN_CONTEXT_MODEL, N_PHONES).to(device)

### b - What would be the correct loss criterion for this task ?



In [19]:
loss_criterion = torch.nn.CrossEntropyLoss()

To perform the fine-tuning, we will also need an optimization function.

We will use an [Adam optimizer ](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam).

In [20]:
parameters = list(phone_classifier.parameters()) + list(cpc_model.parameters())
LEARNING_RATE = 2e-4
optimizer = torch.optim.Adam(parameters, lr=LEARNING_RATE)

You might also want to perform this training while freezing the weights of the ```cpc_model```. Indeed, if the pre-training was good enough, then ```cpc_model``` phonemes representation should be linearly separable. In this case the optimizer should be defined like this:

In [24]:
optimizer_frozen = torch.optim.Adam(list(phone_classifier.parameters()), lr=LEARNING_RATE)

### c- Now let's build a training loop. 
Complete the function ```train_one_epoch``` below.



In [58]:
def train_one_epoch(cpc_model, 
                    phone_classifier, 
                    loss_criterion, 
                    data_loader, 
                    optimizer):

  cpc_model.train()
  loss_criterion.train()

  avg_loss = 0
  avg_accuracy = 0
  n_items = 0
  for step, full_data in enumerate(data_loader):
    # Each batch is represented by a Tuple of vectors:
    # sequence of size : N x 1 x T
    # label of size : N x T
    # 
    # With :
    # - N number of sequence in the batch
    # - T size of each sequence
    sequence, label = full_data
    
    

    bs = len(sequence)
    seq_len = label.size(1)
    optimizer.zero_grad()
    context_out, enc_out, _ = cpc_model(sequence.to(device),label.to(device))

    scores = phone_classifier(context_out)

    scores = scores.permute(0,2,1)
    loss = loss_criterion(scores,label.to(device))
    loss.backward()
    optimizer.step()
    avg_loss+=loss.item()*bs
    n_items+=bs
    correct_labels = scores.argmax(1)
    avg_accuracy += ((label==correct_labels.cpu()).float()).mean(1).sum().item()
  avg_loss/=n_items
  avg_accuracy/=n_items
  return avg_loss, avg_accuracy
    

Don't forget to test it !

In [59]:
avg_loss, avg_accuracy = train_one_epoch(cpc_model, phone_classifier, loss_criterion, data_loader_train, optimizer_frozen)

In [60]:
avg_loss, avg_accuracy

(1.3621995804512956, 0.6497440501715266)

### d- Build the validation loop

In [61]:
def validation_step(cpc_model, 
                    phone_classifier, 
                    loss_criterion, 
                    data_loader):
  
  cpc_model.eval()
  phone_classifier.eval()

  avg_loss = 0
  avg_accuracy = 0
  n_items = 0
  with torch.no_grad():
    for step, full_data in enumerate(data_loader):
      # Each batch is represented by a Tuple of vectors:
      # sequence of size : N x 1 x T
      # label of size : N x T
      # 
      # With :
      # - N number of sequence in the batch
      # - T size of each sequence
      sequence, label = full_data
      bs = len(sequence)
      seq_len = label.size(1)
      context_out, enc_out, _ = cpc_model(sequence.to(device),label.to(device))
      scores = phone_classifier(context_out)
      scores = scores.permute(0,2,1)
      loss = loss_criterion(scores,label.to(device))
      avg_loss+=loss.item()*bs
      n_items+=bs
      correct_labels = scores.argmax(1)
      avg_accuracy += ((label==correct_labels.cpu()).float()).mean(1).sum().item()
  avg_loss/=n_items
  avg_accuracy/=n_items
  return avg_loss, avg_accuracy

### e- Run everything

Test this functiion with both ```optimizer``` and ```optimizer_frozen```.

In [65]:
def run(cpc_model, 
        phone_classifier, 
        loss_criterion, 
        data_loader_train, 
        data_loader_val, 
        optimizer,
        n_epoch):

  for epoch in range(n_epoch):

    print(f"Running epoch {epoch + 1} / {n_epoch}")
    loss_train, acc_train = train_one_epoch(cpc_model, phone_classifier, loss_criterion, data_loader_train, optimizer)
    print("-------------------")
    print(f"Training dataset :")
    print(f"Average loss : {loss_train}. Average accuracy {acc_train}")

    print("-------------------")
    print("Validation dataset")
    loss_val, acc_val = validation_step(cpc_model, phone_classifier, loss_criterion, data_loader_val)
    print(f"Average loss : {loss_val}. Average accuracy {acc_val}")
    print("-------------------")
    print()

In [67]:
run(cpc_model,phone_classifier,loss_criterion,data_loader_train,data_loader_val,optimizer_frozen,n_epoch=10)

Running epoch 1 / 10
-------------------
Training dataset :
Average loss : 1.0219527069604895. Average accuracy 0.7057498257933105
-------------------
Validation dataset
Average loss : 1.0688188383544701. Average accuracy 0.6966259057971015
-------------------

Running epoch 2 / 10
-------------------
Training dataset :
Average loss : 0.9813667901793229. Average accuracy 0.7150101397226987
-------------------
Validation dataset
Average loss : 1.0425163616304811. Average accuracy 0.7018257472826087
-------------------

Running epoch 3 / 10
-------------------
Training dataset :
Average loss : 0.9586697046657233. Average accuracy 0.7197957984919954
-------------------
Validation dataset
Average loss : 1.029444672577623. Average accuracy 0.7042572463768116
-------------------

Running epoch 4 / 10
-------------------
Training dataset :
Average loss : 0.9506696053679975. Average accuracy 0.7210777810534591
-------------------
Validation dataset
Average loss : 0.9972989958265553. Average ac

## Exercise 2 : Phone separability without alignment (PER)

Aligned data are very practical, but un real life they are rarely available. That's why in this excercise we will consider a fine-tuning with non-aligned phonemes.

The model, the optimizer and the phone classifier will stay the same. However, we will replace our phone criterion with a [CTC loss](https://pytorch.org/docs/master/generated/torch.nn.CTCLoss.html). 

In [16]:
loss_ctc = torch.nn.CTCLoss()

Besides, we will use a siglthy different dataset class.

In [17]:
%cd /content/CPC_audio
from cpc.eval.common_voices_eval import SingleSequenceDataset, parseSeqLabels, findAllSeqs
path_train_data_per = '/content/per_data/pack_master/1h'
path_val_data_per = '/content/per_data/pack_master/10min'
path_phone_data_per = '/content/per_data/pack_master/10h_phones.txt'
BATCH_SIZE=8

phone_labels, N_PHONES = parseSeqLabels(path_phone_data_per)
data_train_per, _ = findAllSeqs(path_train_data_per, extension='.flac')
dataset_train_non_aligned = SingleSequenceDataset(path_train_data_per, data_train_per, phone_labels)
data_loader_train = torch.utils.data.DataLoader(dataset_train_non_aligned, batch_size=BATCH_SIZE,
                                                shuffle=True)

data_val_per, _ = findAllSeqs(path_val_data_per, extension='.flac')
dataset_val_non_aligned = SingleSequenceDataset(path_val_data_per, data_val_per, phone_labels)
data_loader_val = torch.utils.data.DataLoader(dataset_val_non_aligned, batch_size=BATCH_SIZE,
                                              shuffle=True)

67it [00:00, 16115.29it/s]

/content/CPC_audio
Saved cache file at /content/per_data/pack_master/1h/_seqs_cache.txt



7it [00:00, 1867.22it/s]

Loaded 287 sequences in 4.05 seconds
maxSizeSeq : 273359
maxSizePhone : 207
minSizePhone : 17
Total size dataset 1.0406152430555555 hours
Saved cache file at /content/per_data/pack_master/10min/_seqs_cache.txt





Loaded 212 sequences in 2.49 seconds
maxSizeSeq : 273760
maxSizePhone : 188
minSizePhone : 17
Total size dataset 0.73 hours


### a- Training

Since the phonemes are not aligned, there is no simple direct way to get the classification acuracy of a model. Write and test the three functions ```train_one_epoch_ctc```, ```validation_step_ctc``` and ```run_ctc``` as before but without considering the average acuracy of the model. 

In [21]:
from cpc.feature_loader import loadModel

checkpoint_path = 'checkpoint_data/checkpoint_30.pt'
cpc_model, HIDDEN_CONTEXT_MODEL, HIDDEN_ENCODER_MODEL = loadModel([checkpoint_path])
cpc_model = cpc_model.cuda()
phone_classifier = PhoneClassifier(HIDDEN_CONTEXT_MODEL, N_PHONES).to(device)

Loading checkpoint checkpoint_data/checkpoint_30.pt
Loading the state dict at checkpoint_data/checkpoint_30.pt


In [19]:
parameters = list(phone_classifier.parameters()) + list(cpc_model.parameters())
LEARNING_RATE = 2e-4
optimizer = torch.optim.Adam(parameters, lr=LEARNING_RATE)

optimizer_frozen = torch.optim.Adam(list(phone_classifier.parameters()), lr=LEARNING_RATE)

In [34]:
import torch.nn.functional as F

def train_one_epoch_ctc(cpc_model, 
                        phone_classifier, 
                        loss_criterion, 
                        data_loader, 
                        optimizer):
  
  cpc_model.train()
  loss_criterion.train()

  avg_loss = 0
  avg_accuracy = 0
  n_items = 0
  for step, full_data in enumerate(data_loader):

    x, x_len, y, y_len = full_data

    x_batch_len = x.shape[-1]
    x, y = x.to(device), y.to(device)

    bs=x.size(0)
    optimizer.zero_grad()
    context_out, enc_out, _ = cpc_model(x.to(device),y.to(device))
  
    scores = phone_classifier(context_out)
    scores = scores.permute(1,0,2)
    scores = F.log_softmax(scores,2)
    yhat_len = torch.tensor([int(scores.shape[0]*x_len[i]/x_batch_len) for i in range(scores.shape[1])]) # this is an approximation, should be good enough

    loss = loss_criterion(scores,y.to(device),yhat_len,y_len)
    loss.backward()
    optimizer.step()
    avg_loss+=loss.item()*bs
    n_items+=bs
  avg_loss/=n_items
  return avg_loss

def validation_step(cpc_model, 
                    phone_classifier, 
                    loss_criterion, 
                    data_loader):

  cpc_model.eval()
  phone_classifier.eval()
  avg_loss = 0
  avg_accuracy = 0
  n_items = 0
  with torch.no_grad():
    for step, full_data in enumerate(data_loader):

      x, x_len, y, y_len = full_data

      x_batch_len = x.shape[-1]
      x, y = x.to(device), y.to(device)

      bs=x.size(0)
      context_out, enc_out, _ = cpc_model(x.to(device),y.to(device))
    
      scores = phone_classifier(context_out)
      scores = scores.permute(1,0,2)
      scores = F.log_softmax(scores,2)
      yhat_len = torch.tensor([int(scores.shape[0]*x_len[i]/x_batch_len) for i in range(scores.shape[1])]) # this is an approximation, should be good enough

      loss = loss_criterion(scores,y.to(device),yhat_len,y_len)
      avg_loss+=loss.item()*bs
      n_items+=bs
  avg_loss/=n_items

  return avg_loss

def run_ctc(cpc_model, 
            phone_classifier, 
            loss_criterion, 
            data_loader_train, 
            data_loader_val, 
            optimizer,
            n_epoch):
  for epoch in range(n_epoch):

    print(f"Running epoch {epoch + 1} / {n_epoch}")
    loss_train = train_one_epoch_ctc(cpc_model, phone_classifier, loss_criterion, data_loader_train, optimizer)
    print("-------------------")
    print(f"Training dataset :")
    print(f"Average loss : {loss_train}.")

    print("-------------------")
    print("Validation dataset")
    loss_val = validation_step(cpc_model, phone_classifier, loss_criterion, data_loader_val)
    print(f"Average loss : {loss_val}")
    print("-------------------")
    print()

In [40]:
run_ctc(cpc_model,phone_classifier,loss_ctc,data_loader_train,data_loader_val,optimizer_frozen,n_epoch=10)

Running epoch 1 / 10
-------------------
Training dataset :
Average loss : 32.44543953208657.
-------------------
Validation dataset
Average loss : 32.01081585093132
-------------------

Running epoch 2 / 10
-------------------
Training dataset :
Average loss : 30.99022026328774.
-------------------
Validation dataset
Average loss : 30.300324444522225
-------------------

Running epoch 3 / 10
-------------------
Training dataset :
Average loss : 29.319565432888645.
-------------------
Validation dataset
Average loss : 28.464903420181635
-------------------

Running epoch 4 / 10
-------------------
Training dataset :
Average loss : 27.567655403297262.
-------------------
Validation dataset
Average loss : 26.642856191119876
-------------------

Running epoch 5 / 10
-------------------
Training dataset :
Average loss : 25.832834390493538.
-------------------
Validation dataset
Average loss : 24.82515669546986
-------------------

Running epoch 6 / 10
-------------------
Training dataset :

### b- Evaluation: the Phone Error Rate (PER)

In order to compute the similarity between two sequences, we can use the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). This distance estimates the minimum number of insertion, deletion and addition to move from one sequence to another. If we normalize this distance by the number of characters in the reference sequence we get the Phone Error Rate (PER).

This value can be interpreted as :
\\[  PER = \frac{S + D + I}{N} \\]

Where:


*   N is the number of characters in the reference
*   S is the number of substitutiion
*   I in the number of insertion
*   D in the number of deletion

For the best possible alignment of the two sequences.



In [15]:
import numpy as np

def get_PER_sequence(ref_seq, target_seq):

  # re = g.split()
  # h = h.split()
  n = len(ref_seq)
  m = len(target_seq)

  D = np.zeros((n+1,m+1))
  for i in range(1,n+1):
    D[i,0] = D[i-1,0]+1
  for j in range(1,m+1):
    D[0,j] = D[0,j-1]+1
  
  ### TODO compute the alignment

  for i in range(1,n+1):
    for j in range(1,m+1):
      D[i,j] = min(
          D[i-1,j]+1,
          D[i-1,j-1]+1,
          D[i,j-1]+1,
          D[i-1,j-1]+ 0 if ref_seq[i-1]==target_seq[j-1] else float("inf")
      )
  return D[n,m]/len(ref_seq)
  

  #return PER

You can test your function below:

In [45]:
ref_seq = [0, 1, 1, 2, 0, 2, 2]
pred_seq = [1, 1, 2, 2, 0, 0]

expected_PER = 4. / 7.
print(get_PER_sequence(ref_seq, pred_seq) == expected_PER)

True


## c- Evaluating the PER of your model on the test dataset

Evaluate the PER on the validation dataset. Please notice that you should usually use a separate dataset, called the dev dataset, to perform this operation. However for the sake of simplicity we will work with validation data in this exercise.

In [16]:
import progressbar
from multiprocessing import Pool

def cut_data(seq, sizeSeq):
    maxSeq = sizeSeq.max()
    return seq[:, :maxSeq]


def prepare_data(data):
    seq, sizeSeq, phone, sizePhone = data
    seq = seq.cuda()
    phone = phone.cuda()
    sizeSeq = sizeSeq.cuda().view(-1)
    sizePhone = sizePhone.cuda().view(-1)

    seq = cut_data(seq.permute(0, 2, 1), sizeSeq).permute(0, 2, 1)
    return seq, sizeSeq, phone, sizePhone


def get_per(test_dataloader,
            cpc_model,
            phone_classifier):

  downsampling_factor = 160
  cpc_model.eval()
  phone_classifier.eval()

  avgPER = 0
  nItems = 0 

  print("Starting the PER computation through beam search")
  bar = progressbar.ProgressBar(maxval=len(test_dataloader))
  bar.start()

  for index, data in enumerate(test_dataloader):

    bar.update(index)

    with torch.no_grad():
      
        seq, sizeSeq, phone, sizePhone = prepare_data(data)
        c_feature, _, _ = cpc_model(seq.to(device),phone.to(device))
        sizeSeq = sizeSeq / downsampling_factor
        predictions = torch.nn.functional.softmax(
        phone_classifier(c_feature), dim=2).cpu()
        phone = phone.cpu()
        sizeSeq = sizeSeq.cpu()
        sizePhone = sizePhone.cpu()

        bs = c_feature.size(0)
        data_per = [(predictions[b].argmax(1),  phone[b]) for b in range(bs)]
        # data_per = [(predictions[b], sizeSeq[b], phone[b], sizePhone[b],
        #               "criterion.module.BLANK_LABEL") for b in range(bs)]

        with Pool(bs) as p:
            poolData = p.starmap(get_PER_sequence, data_per)
        avgPER += sum([x for x in poolData])
        nItems += len(poolData)

  bar.finish()

  avgPER /= nItems

  print(f"Average PER {avgPER}")
  return avgPER


In [91]:
get_per(data_loader_val,cpc_model,phone_classifier)

N/A% (0 of 27) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

Starting the PER computation through beam search


100% (27 of 27) |########################| Elapsed Time: 0:10:44 Time:  0:10:44


Average PER 0.9509821691500522


0.9509821691500522

## Exercice 3 : Character error rate (CER) 

The Character Error Rate (CER) is an evaluation metric similar to the PER but with characters insterad of phonemes. Using the following data, run the functions you defined previously to estimate the CER of your model after fine-tuning.

In [23]:
# Load a dataset labelled with the letters of each sequence.
%cd /content/CPC_audio
from cpc.eval.common_voices_eval import SingleSequenceDataset, parseSeqLabels, findAllSeqs
path_train_data_cer = '/content/per_data/pack_master/1h'
path_val_data_cer = '/content/per_data/pack_master/10min'
path_letter_data_cer = '/content/per_data/pack_master/chars.txt'
BATCH_SIZE=8

letters_labels, N_LETTERS = parseSeqLabels(path_letter_data_cer)
data_train_cer, _ = findAllSeqs(path_train_data_cer, extension='.flac')
dataset_train_non_aligned = SingleSequenceDataset(path_train_data_cer, data_train_cer, letters_labels)


data_val_cer, _ = findAllSeqs(path_val_data_cer, extension='.flac')
dataset_val_non_aligned = SingleSequenceDataset(path_val_data_cer, data_val_cer, letters_labels)


# The data loader will generate a tuple of tensors data, labels for each batch
# data : size N x T1 x 1 : the audio sequence
# label : size N x T2 the sequence of letters corresponding to the audio data
# IMPORTANT NOTE: just like the PER the CER is computed with non-aligned phone data.
data_loader_train_letters = torch.utils.data.DataLoader(dataset_train_non_aligned, batch_size=BATCH_SIZE,
                                                shuffle=True)
data_loader_val_letters = torch.utils.data.DataLoader(dataset_val_non_aligned, batch_size=BATCH_SIZE,
                                              shuffle=True)

67it [00:00, 10784.76it/s]

/content/CPC_audio
Saved cache file at /content/per_data/pack_master/1h/_seqs_cache.txt



7it [00:00, 1741.82it/s]

Loaded 287 sequences in 2.68 seconds
maxSizeSeq : 273359
maxSizePhone : 300
minSizePhone : 18
Total size dataset 1.0406152430555555 hours
Saved cache file at /content/per_data/pack_master/10min/_seqs_cache.txt





Loaded 212 sequences in 2.29 seconds
maxSizeSeq : 273760
maxSizePhone : 273
minSizePhone : 29
Total size dataset 0.73 hours


In [24]:
from cpc.feature_loader import loadModel

checkpoint_path = 'checkpoint_data/checkpoint_30.pt'
cpc_model, HIDDEN_CONTEXT_MODEL, HIDDEN_ENCODER_MODEL = loadModel([checkpoint_path])
cpc_model = cpc_model.cuda()
character_classifier = PhoneClassifier(HIDDEN_CONTEXT_MODEL, N_LETTERS).to(device)

Loading checkpoint checkpoint_data/checkpoint_30.pt
Loading the state dict at checkpoint_data/checkpoint_30.pt


In [27]:
parameters = list(character_classifier.parameters()) + list(cpc_model.parameters())
LEARNING_RATE = 2e-4
optimizer = torch.optim.Adam(parameters, lr=LEARNING_RATE)

optimizer_frozen = torch.optim.Adam(list(character_classifier.parameters()), lr=LEARNING_RATE)

In [25]:
loss_ctc = torch.nn.CTCLoss()

In [35]:
run_ctc(cpc_model,character_classifier,loss_ctc,data_loader_train_letters,data_loader_val_letters,optimizer_frozen,n_epoch=10)

Running epoch 1 / 10
-------------------
Training dataset :
Average loss : 17.15224294729166.
-------------------
Validation dataset
Average loss : 16.621312869103598
-------------------

Running epoch 2 / 10
-------------------
Training dataset :
Average loss : 15.531890602378578.
-------------------
Validation dataset
Average loss : 14.840831449246519
-------------------

Running epoch 3 / 10
-------------------
Training dataset :
Average loss : 13.80899079863008.
-------------------
Validation dataset
Average loss : 12.998893340052021
-------------------

Running epoch 4 / 10
-------------------
Training dataset :
Average loss : 12.093861906678526.
-------------------
Validation dataset
Average loss : 11.24841435599666
-------------------

Running epoch 5 / 10
-------------------
Training dataset :
Average loss : 10.522326436076131.
-------------------
Validation dataset
Average loss : 9.72009142654202
-------------------

Running epoch 6 / 10
-------------------
Training dataset :


In [36]:
get_per(data_loader_val_letters,cpc_model,character_classifier)

N/A% (0 of 27) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

Starting the PER computation through beam search


100% (27 of 27) |########################| Elapsed Time: 0:17:48 Time:  0:17:48


Average PER 0.9113886992183796


0.9113886992183796

# Part 1 : contrastive predictive coding

Contrastive Predictive Coding (CPC) is a method of unsupervised training for speech models. The idea behind it is pretty simple:


1.   The raw audio wave is passed through a convolutional network: the ```encoder```
2.   Then, the encoder's output is given to a recurrent network the ```context```
3. A third party network, the ```prediction_network``` will try to predict the  future embeddings of the encoder using the output of the context network.

In order to avoid a collapse to trivial solutions, the prediction_network doesn't try to reconstruct the future features. Instead, using the context output $c_t$ at time $t$ it is trained to discriminate the real  encoder representatioin $g_{t+k}$ at time $t+k$ from several other features $(g_n)_n$ taken elsewhere in the batch. Thus the loss becomes:

\\[ \mathcal{L}_c = - \frac{1}{K} \sum_{k=1}^K \text{Cross_entropy}(\phi_k(c_t), g_{t+k}) \\]

Or:

\\[ \mathcal{L}_c = - \frac{1}{K} \sum_{k=1}^K \log \frac{ \exp\left(\phi_k(c_t)^\top g_{t+k}\right) }{  \sum_{\mathbf{n}\in\mathcal{N}_t} \exp\left(\phi_k(c_t)^\top g_n\right)} \\]

Where:


*   $\phi_k$ is the prediction network for the kth timestep
*   $\mathcal{N}_t$ is the set of all negative examples sampled for timestep $t$




## Exercice 1 : Building the model

In this exercise, we will build and train a small CPC model using the repository CPC_audio.

The code below loads a context and an encoder newtorks.

In [2]:
%cd /content/CPC_audio
from cpc.model import CPCEncoder, CPCAR
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

DIM_ENCODER=256
DIM_CONTEXT=256
KEEP_HIDDEN_VECTOR=False
N_LEVELS_CONTEXT=1
CONTEXT_RNN="LSTM"
N_PREDICTIONS=12
LEARNING_RATE=2e-4
N_NEGATIVE_SAMPLE =128

/content/CPC_audio


In [3]:
encoder = CPCEncoder(DIM_ENCODER).to(device)
context = CPCAR(DIM_ENCODER, DIM_CONTEXT, KEEP_HIDDEN_VECTOR, 1, mode=CONTEXT_RNN).to(device)

In [4]:
# Several functions that will be necessary to load the data later
from cpc.dataset import findAllSeqs, AudioBatchData, parseSeqLabels
SIZE_WINDOW = 20480
BATCH_SIZE=8
def load_dataset(path_dataset, file_extension='.flac', phone_label_dict=None):
  data_list, speakers = findAllSeqs(path_dataset, extension=file_extension)
  dataset = AudioBatchData(path_dataset, SIZE_WINDOW, data_list, phone_label_dict, len(speakers))
  return dataset

Now build a new class, ```CPCModel``` which will

In [5]:
class CPCModel(torch.nn.Module):

    def __init__(self,
                 encoder,
                 AR):

        super(CPCModel, self).__init__()
        self.gEncoder = encoder
        self.gAR = AR

    def forward(self, batch_data):
        

        encoder_output = self.gEncoder(batch_data)
        #print(encoder_output.shape)
        # The output of the encoder data does not have the good format 
        # indeed it is Batch_size x Hidden_size x temp size
        # while the context requires Batch_size  x temp size x Hidden_size
        # thus you need to permute
        context_input = encoder_output.permute(0, 2, 1)

        context_output = self.gAR(context_input)
        #print(context_output.shape)
        return context_output, encoder_output

Let's test your code !


In [6]:
audio = torchaudio.load("/content/train_data/831/130739/831-130739-0048.flac")[0]
audio = audio.view(1, 1, -1)
cpc_model = CPCModel(encoder, context).to(device)
context_output, encoder_output = cpc_model(audio.to(device))

## Exercise 2 : CPC loss

We will define a class ```CPCCriterion``` which will hold the prediction networks $\phi_k$ defined above and perform the classification loss $\mathcal{L}_c$.

a) In this exercise, the $\phi_k$ will be a linear transform, ie:

\\[ \phi_k(c_t) = \mathbf{A}_k c_t\\]

Using the class [torch.nn.Linear](https://pytorch.org/docs/stable/nn.html#torch.nn.Linear), define the transformations $\phi_k$ in the code below and complete the function ```get_prediction_k``` which computes $\phi_k(c_t)$ for a given batch of vectors $c_t$.

b) Using both ```get_prediction_k```  and ```sample_negatives``` defined below, write the forward function which will take as input two batches of features $c_t$ and $g_t$ and outputs the classification loss $\mathcal{L}_c$ and the average acuracy for all predictions. 

In [7]:
# Exercice 2: write the CPC loss
# a) Write the negative sampling (with some help)
# ERRATUM: it's really hard, the sampling will be provided

class CPCCriterion(torch.nn.Module):

  def __init__(self,
               K,
               dim_context,
               dim_encoder,
               n_negative):
    super(CPCCriterion, self).__init__()
    self.K_ = K
    self.dim_context = dim_context
    self.dim_encoder = dim_encoder
    self.n_negative = n_negative

    self.predictors = torch.nn.ModuleList() 
    for k in range(self.K_):
      # TO COMPLETE !
      
      # A affine transformation in pytorch is equivalent to a nn.Linear layer
      # To get a linear transformation you must set bias=False
      # input dimension of the layer = dimension of the encoder
      # output dimension of the layer = dimension of the context
      self.predictors.append(torch.nn.Linear(dim_context, dim_encoder, bias=False))

  def get_prediction_k(self, context_data):

    #TO COMPLETE !
    output = [] 
    # For each time step k
    for k in range(self.K_):

      # We need to compute phi_k = A_k * c_t
      phi_k = self.predictors[k](context_data)
      output.append(phi_k)

    return output


  def sample_negatives(self, encoded_data):
    r"""
    Sample some negative examples in the given encoded data.
    Input:
    - encoded_data size: B x T x H
    Returns
    - outputs of size B x (n_negative + 1) x (T - K_) x H
      outputs[:, 0, :, :] contains the positive example
      outputs[:, 1:, :, :] contains negative example sampled in the batch
    - labels, long tensor of size B x (T - K_)
      Since the positive example is always at coordinates 0 for all sequences 
      in the batch and all timestep in the sequence, labels is just a tensor
      full of zeros !
    """
    batch_size, time_size, dim_encoded = encoded_data.size()
    window_size = time_size - self.K_
    outputs = []

    neg_ext = encoded_data.contiguous().view(-1, dim_encoded)
    n_elem_sampled = self.n_negative * window_size * batch_size
    # Draw nNegativeExt * batchSize negative samples anywhere in the batch
    batch_idx = torch.randint(low=0, high=batch_size,
                              size=(n_elem_sampled, ),
                              device=encoded_data.device)

    seq_idx = torch.randint(low=1, high=time_size,
                            size=(n_elem_sampled, ),
                            device=encoded_data.device)

    base_idx = torch.arange(0, window_size, device=encoded_data.device)
    base_idx = base_idx.view(1, 1, window_size)
    base_idx = base_idx.expand(1, self.n_negative, window_size)
    base_idx = base_idx.expand(batch_size, self.n_negative, window_size)
    seq_idx += base_idx.contiguous().view(-1)
    seq_idx = torch.remainder(seq_idx, time_size)

    ext_idx = seq_idx + batch_idx * time_size
    neg_ext = neg_ext[ext_idx].view(batch_size, self.n_negative,
                                    window_size, dim_encoded)
    label_loss = torch.zeros((batch_size, window_size),
                              dtype=torch.long,
                              device=encoded_data.device)

    for k in range(1, self.K_ + 1):

      # Positive samples
      if k < self.K_:
          pos_seq = encoded_data[:, k:-(self.K_-k)]
      else:
          pos_seq = encoded_data[:, k:]

      pos_seq = pos_seq.view(batch_size, 1, pos_seq.size(1), dim_encoded)
      full_seq = torch.cat((pos_seq, neg_ext), dim=1)
      outputs.append(full_seq)

    return outputs, label_loss

  def forward(self, encoded_data, context_data):

    # TO COMPLETE:
    # Perform the full cpc criterion
    # Returns 2 values:
    # - the average classification loss avg_loss
    # - the average classification acuracy avg_acc

    # Reminder : The permuation !
    encoded_data = encoded_data.permute(0, 2, 1)

    # First we need to sample the negative examples
    negative_samples, labels = self.sample_negatives(encoded_data)

    # Then we must compute phi_k
    phi_k = self.get_prediction_k(context_data)

    # Finally we must get the dot product between phi_k and negative_samples 
    # for each k

    #The total loss is the average of all losses
    avg_loss = 0

    # Average acuracy
    avg_acc = 0

    for k in range(self.K_):
      B, N_sampled, S_small, H = negative_samples[k].size() 
      B, S, H = phi_k[k].size()

      # As told before S = S_small + K. For segments too far in the sequence
      # there are no positive exmples anyway, so we must shorten phi_k
      phi = phi_k[k][:, :S_small]

      # Now the dot product
      # You have several ways to do that, let's do the simple but non optimal 
      # one
      # pytorch has a matrix product function https://pytorch.org/docs/stable/torch.html#torch.bmm
      # But it takes only 3D tensors of the same batch size !
      # To begin negative_samples is a 4D tensor ! 
      # We want to compute the dot product for each features, of each sequence
      # of the batch. Thus we are trying to compute a dot product for all
      # B* N_sampled * S_small 1D vector of negative_samples[k]
      # Or, a 1D tensor of size H is also a matrix of size 1 x H
      # Then, we must view it as a 3D tensor of size (B* N_sampled * S_small, 1, H)
      negative_sample_k  =  negative_samples[k].view(B* N_sampled* S_small, 1, H)

      # But now phi and negative_sample_k no longer have the same batch size !
      # No worries, we can expand phi so that each sequence of the batch
      # is repeated N_sampled times
      phi = phi.view(B, 1,S_small, H).expand(B, N_sampled, S_small, H)

      # And now we can view it as a 3D tensor 
      phi  = phi.contiguous().view(B * N_sampled * S_small, H, 1)

      # We can finally get the dot product !
      scores = torch.bmm(negative_sample_k, phi)

      # Dot_product has a size (B * N_sampled * S_small , 1, 1)
      # Let's reorder it a bit
      scores = scores.reshape(B, N_sampled, S_small)

      # For each elements of the sequence, and each elements sampled, it gives 
      # a floating score stating the likelihood of this element being the 
      # true one.
      # Now the classification loss, we need to use the Cross Entropy loss
      # https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html

      # For each time-step of each sequence of the batch 
      # we have N_sampled possible predictions. 
      # Looking at the documentation of torch.nn.CrossEntropyLoss
      # we can see that this loss expect a tensor of size M x C where 
      # - M is the number of elements with a classification score
      # - C is the number of possible classes
      # There are N_sampled candidates for each predictions so
      # C = N_sampled 
      # Each timestep of each sequence of the batch has a prediction so
      # M = B * S_small
      # Thus we need an input vector of size B * S_small, N_sampled
      # To begin, we need to permute the axis
      scores = scores.permute(0, 2, 1) # Now it has size B , S_small, N_sampled

      # Then we can cast it into a 2D tensor
      scores = scores.reshape(B * S_small, N_sampled)

      # Same thing for the labels 
      labels = labels.reshape(B * S_small)

      # Finally we can get the classification loss
      loss_criterion = torch.nn.CrossEntropyLoss()
      loss_k = loss_criterion(scores, labels)
      avg_loss+= loss_k

      # And for the acuracy
      # The prediction for each elements is the sample with the highest score
      # Thus the tensors of all predictions is the tensors of the index of the 
      # maximal score for each time-step of each sequence of the batch
      predictions = torch.argmax(scores, 1)
      acc_k  = (labels == predictions).sum() / (B * S_small)
      avg_acc += acc_k

    # Normalization
    avg_loss = avg_loss / self.K_
    avg_acc = avg_acc / self.K_
      
    return avg_loss , avg_acc

Don't forget to test !

In [8]:
audio = torchaudio.load("/content/train_data/831/130739/831-130739-0048.flac")[0]
audio = audio.view(1, 1, -1)
cpc_criterion = CPCCriterion(N_PREDICTIONS, DIM_CONTEXT, 
                             DIM_ENCODER, N_NEGATIVE_SAMPLE).to(device)
context_output, encoder_output = cpc_model(audio.to(device))
loss, avg = cpc_criterion(encoder_output,context_output)



## Exercise 3: Full training loop !

You have the model, you have the criterion. All you need now are a data loader and an optimizer to run your training loop.

We will use an Adam optimizer:

In [9]:
parameters = list(cpc_criterion.parameters()) + list(cpc_model.parameters())
optimizer = torch.optim.Adam(parameters, lr=LEARNING_RATE)

And as far as the data loader is concerned, we will rely on the data loader provided by the CPC_audio library.

In [10]:
dataset_train = load_dataset('/content/train_data')
dataset_val = load_dataset('/content/val_data')
data_loader_train = dataset_train.getDataLoader(BATCH_SIZE, "speaker", True)
data_loader_val = dataset_train.getDataLoader(BATCH_SIZE, "sequence", False)

162it [00:00, 14774.14it/s]

Saved cache file at /content/train_data/_seqs_cache.txt



1430it [00:00, 700683.96it/s]

Checking length...
Done, elapsed: 0.142 seconds
Scanned 1430 sequences in 0.14 seconds
1 chunks computed
Joining pool





Joined process, elapsed=6.853 secs


169it [00:00, 1200.16it/s]


Saved cache file at /content/val_data/_seqs_cache.txt


316it [00:00, 479696.01it/s]

Checking length...
Done, elapsed: 0.105 seconds
Scanned 316 sequences in 0.10 seconds
1 chunks computed
Joining pool





Joined process, elapsed=2.870 secs


Now that everything is ready, complete and test the ```train_step``` function below which trains the model for one epoch.

In [17]:
def train_step(data_loader,
               cpc_model,
               cpc_criterion,
               optimizer):
  
  avg_loss = 0
  avg_acc = 0
  n_items = 0

  for step, data in enumerate(data_loader):
    x,y = data
    bs = len(x)
    optimizer.zero_grad()
    context_output, encoder_output = cpc_model(x.to(device))
    loss , acc = cpc_criterion(encoder_output, context_output)
    loss.backward()
    n_items+=bs
    avg_loss+=loss.item()*bs
    avg_acc +=acc.item()*bs
  
  avg_loss/=n_items
  avg_acc/=n_items
  return avg_loss, avg_acc

## Exercise 4 : Validation loop

Now complete the validation loop.

In [18]:
def validation_step(data_loader,
                    cpc_model,
                    cpc_criterion):
  
  avg_loss = 0
  avg_acc = 0
  n_items = 0

  for step, data in enumerate(data_loader):
    x,y = data
    bs = len(x)
    context_output, encoder_output = cpc_model(x.to(device))
    loss , acc = cpc_criterion(encoder_output, context_output)
    n_items+=bs
    avg_loss+=loss.item()*bs
    avg_acc+=acc.item()*bs
  
  avg_loss/=n_items
  avg_acc/=n_items
  return avg_loss, avg_acc

## Exercise 5: Run everything

In [19]:
def run(train_loader,
        val_loader,
        cpc_model,
        cpc_criterion,
        optimizer,
        n_epochs):
  
  for epoch in range(n_epochs):

    
    print(f"Running epoch {epoch+1} / {n_epochs}")
    avg_loss_train, avg_acc_train = train_step(train_loader, cpc_model, cpc_criterion, optimizer)
    print("----------------------")
    print(f"Training dataset")
    print(f"- average loss : {avg_loss_train}")
    print(f"- average acuracy : {avg_acc_train}")
    print("----------------------")
    with torch.no_grad():
      cpc_model.eval()
      cpc_criterion.eval()
      avg_loss_val, avg_acc_val = validation_step(val_loader, cpc_model, cpc_criterion)
      print(f"Validation dataset")
      print(f"- average loss : {avg_loss_val}")
      print(f"- average acuracy : {avg_acc_val}")
      print("----------------------")
      print()
      cpc_model.train()
      cpc_criterion.train()

In [20]:
run(data_loader_train, data_loader_val, cpc_model,cpc_criterion,optimizer,1)

Running epoch 1 / 1
----------------------
Training dataset
- average loss : 4.878548990311724
- average acuracy : 0.0
----------------------
Validation dataset
- average loss : 4.878525506558247
- average acuracy : 0.0
----------------------



Once everything is donw, clear the memory.

In [37]:
del dataset_train
del dataset_val
del cpc_model
del context
del encoder

# Part 2 : Fine tuning

## Exercice 1 : Phone separability with aligned phonemes.

One option to evaluate the quality of the features trained with CPC can be to check if they can be used to recognize phonemes. 
To do so, we can fine-tune a pre-trained model using a limited amount of labelled speech data.
We are going to start with a simple evaluation setting where we have the phone labels for each timestep corresponding to a CPC feature.

We will work with a model already pre-trained on English data. As far as the fine-tuning dataset is concerned, we will use a 1h subset of [librispeech-100](http://www.openslr.org/12/). 

In [19]:
!mkdir checkpoint_data
!wget https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_30.pt -P checkpoint_data
!wget https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_logs.json -P checkpoint_data
!wget https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_args.json -P checkpoint_data
!ls checkpoint_data

--2020-06-29 10:11:59--  https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_30.pt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113599715 (108M) [application/octet-stream]
Saving to: ‘checkpoint_data/checkpoint_30.pt’


2020-06-29 10:12:10 (10.5 MB/s) - ‘checkpoint_data/checkpoint_30.pt’ saved [113599715/113599715]

--2020-06-29 10:12:12--  https://dl.fbaipublicfiles.com/librilight/CPC_checkpoints/not_hub/2levels_6k_top_ctc/checkpoint_logs.json
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20786 (20K) [text/plain]
Saving to: ‘c

In [15]:
%cd /content/CPC_audio
from cpc.dataset import parseSeqLabels
from cpc.feature_loader import loadModel

checkpoint_path = 'checkpoint_data/checkpoint_30.pt'
cpc_model, HIDDEN_CONTEXT_MODEL, HIDDEN_ENCODER_MODEL = loadModel([checkpoint_path])
cpc_model = cpc_model.cuda()
label_dict, N_PHONES = parseSeqLabels('/content/converted_aligned_phones.txt')
dataset_train = load_dataset('/content/train_data', file_extension='.flac', phone_label_dict=label_dict)
dataset_val = load_dataset('/content/val_data', file_extension='.flac', phone_label_dict=label_dict)
data_loader_train = dataset_train.getDataLoader(BATCH_SIZE, "speaker", True)
data_loader_val = dataset_val.getDataLoader(BATCH_SIZE, "sequence", False)

/content/CPC_audio
Loading checkpoint checkpoint_data/checkpoint_30.pt
Loading the state dict at checkpoint_data/checkpoint_30.pt


162it [00:00, 2417.18it/s]

Saved cache file at /content/train_data/_seqs_cache.txt





Checking length...


1430it [00:00, 1117126.97it/s]


Done, elapsed: 0.274 seconds
Scanned 1430 sequences in 0.27 seconds
1 chunks computed
Joining pool
Joined process, elapsed=7.480 secs


169it [00:00, 662.54it/s]


Saved cache file at /content/val_data/_seqs_cache.txt
Checking length...


316it [00:00, 60357.94it/s]


Done, elapsed: 0.875 seconds
Scanned 316 sequences in 0.88 seconds
1 chunks computed
Joining pool
Joined process, elapsed=2.434 secs


In [16]:
??cpc_model

Then we will use a simple linear classifier to recognize the phonemes from the features produced by ```cpc_model```. 

### a) Build the phone classifier 

Design a class of linear classifiers, ```PhoneClassifier``` that will take as input a batch of sequences of CPC features and output a score vector for each phoneme

In [20]:
class PhoneClassifier(torch.nn.Module):

  def __init__(self,
               input_dim : int,
               n_phones : int):
    super(PhoneClassifier, self).__init__()
    self.linear = torch.nn.Linear(input_dim, n_phones)
    

  def forward(self, x):
    return self.linear(x)

Our phone classifier will then be:

In [None]:
phone_classifier = PhoneClassifier(HIDDEN_CONTEXT_MODEL, N_PHONES).to(device)

### b - What would be the correct loss criterion for this task ?



In [19]:
loss_criterion = torch.nn.CrossEntropyLoss()

To perform the fine-tuning, we will also need an optimization function.

We will use an [Adam optimizer ](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam).

In [20]:
parameters = list(phone_classifier.parameters()) + list(cpc_model.parameters())
LEARNING_RATE = 2e-4
optimizer = torch.optim.Adam(parameters, lr=LEARNING_RATE)

You might also want to perform this training while freezing the weights of the ```cpc_model```. Indeed, if the pre-training was good enough, then ```cpc_model``` phonemes representation should be linearly separable. In this case the optimizer should be defined like this:

In [24]:
optimizer_frozen = torch.optim.Adam(list(phone_classifier.parameters()), lr=LEARNING_RATE)

### c- Now let's build a training loop. 
Complete the function ```train_one_epoch``` below.



In [58]:
def train_one_epoch(cpc_model, 
                    phone_classifier, 
                    loss_criterion, 
                    data_loader, 
                    optimizer):

  cpc_model.train()
  loss_criterion.train()

  avg_loss = 0
  avg_accuracy = 0
  n_items = 0
  for step, full_data in enumerate(data_loader):
    # Each batch is represented by a Tuple of vectors:
    # sequence of size : N x 1 x T
    # label of size : N x T
    # 
    # With :
    # - N number of sequence in the batch
    # - T size of each sequence
    sequence, label = full_data
    
    

    bs = len(sequence)
    seq_len = label.size(1)
    optimizer.zero_grad()
    context_out, enc_out, _ = cpc_model(sequence.to(device),label.to(device))

    scores = phone_classifier(context_out)

    scores = scores.permute(0,2,1)
    loss = loss_criterion(scores,label.to(device))
    loss.backward()
    optimizer.step()
    avg_loss+=loss.item()*bs
    n_items+=bs
    correct_labels = scores.argmax(1)
    avg_accuracy += ((label==correct_labels.cpu()).float()).mean(1).sum().item()
  avg_loss/=n_items
  avg_accuracy/=n_items
  return avg_loss, avg_accuracy
    

Don't forget to test it !

In [59]:
avg_loss, avg_accuracy = train_one_epoch(cpc_model, phone_classifier, loss_criterion, data_loader_train, optimizer_frozen)

In [60]:
avg_loss, avg_accuracy

(1.3621995804512956, 0.6497440501715266)

### d- Build the validation loop

In [61]:
def validation_step(cpc_model, 
                    phone_classifier, 
                    loss_criterion, 
                    data_loader):
  
  cpc_model.eval()
  phone_classifier.eval()

  avg_loss = 0
  avg_accuracy = 0
  n_items = 0
  with torch.no_grad():
    for step, full_data in enumerate(data_loader):
      # Each batch is represented by a Tuple of vectors:
      # sequence of size : N x 1 x T
      # label of size : N x T
      # 
      # With :
      # - N number of sequence in the batch
      # - T size of each sequence
      sequence, label = full_data
      bs = len(sequence)
      seq_len = label.size(1)
      context_out, enc_out, _ = cpc_model(sequence.to(device),label.to(device))
      scores = phone_classifier(context_out)
      scores = scores.permute(0,2,1)
      loss = loss_criterion(scores,label.to(device))
      avg_loss+=loss.item()*bs
      n_items+=bs
      correct_labels = scores.argmax(1)
      avg_accuracy += ((label==correct_labels.cpu()).float()).mean(1).sum().item()
  avg_loss/=n_items
  avg_accuracy/=n_items
  return avg_loss, avg_accuracy

### e- Run everything

Test this functiion with both ```optimizer``` and ```optimizer_frozen```.

In [65]:
def run(cpc_model, 
        phone_classifier, 
        loss_criterion, 
        data_loader_train, 
        data_loader_val, 
        optimizer,
        n_epoch):

  for epoch in range(n_epoch):

    print(f"Running epoch {epoch + 1} / {n_epoch}")
    loss_train, acc_train = train_one_epoch(cpc_model, phone_classifier, loss_criterion, data_loader_train, optimizer)
    print("-------------------")
    print(f"Training dataset :")
    print(f"Average loss : {loss_train}. Average accuracy {acc_train}")

    print("-------------------")
    print("Validation dataset")
    loss_val, acc_val = validation_step(cpc_model, phone_classifier, loss_criterion, data_loader_val)
    print(f"Average loss : {loss_val}. Average accuracy {acc_val}")
    print("-------------------")
    print()

In [67]:
run(cpc_model,phone_classifier,loss_criterion,data_loader_train,data_loader_val,optimizer_frozen,n_epoch=10)

Running epoch 1 / 10
-------------------
Training dataset :
Average loss : 1.0219527069604895. Average accuracy 0.7057498257933105
-------------------
Validation dataset
Average loss : 1.0688188383544701. Average accuracy 0.6966259057971015
-------------------

Running epoch 2 / 10
-------------------
Training dataset :
Average loss : 0.9813667901793229. Average accuracy 0.7150101397226987
-------------------
Validation dataset
Average loss : 1.0425163616304811. Average accuracy 0.7018257472826087
-------------------

Running epoch 3 / 10
-------------------
Training dataset :
Average loss : 0.9586697046657233. Average accuracy 0.7197957984919954
-------------------
Validation dataset
Average loss : 1.029444672577623. Average accuracy 0.7042572463768116
-------------------

Running epoch 4 / 10
-------------------
Training dataset :
Average loss : 0.9506696053679975. Average accuracy 0.7210777810534591
-------------------
Validation dataset
Average loss : 0.9972989958265553. Average ac

## Exercise 2 : Phone separability without alignment (PER)

Aligned data are very practical, but un real life they are rarely available. That's why in this excercise we will consider a fine-tuning with non-aligned phonemes.

The model, the optimizer and the phone classifier will stay the same. However, we will replace our phone criterion with a [CTC loss](https://pytorch.org/docs/master/generated/torch.nn.CTCLoss.html). 

In [16]:
loss_ctc = torch.nn.CTCLoss()

Besides, we will use a siglthy different dataset class.

In [17]:
%cd /content/CPC_audio
from cpc.eval.common_voices_eval import SingleSequenceDataset, parseSeqLabels, findAllSeqs
path_train_data_per = '/content/per_data/pack_master/1h'
path_val_data_per = '/content/per_data/pack_master/10min'
path_phone_data_per = '/content/per_data/pack_master/10h_phones.txt'
BATCH_SIZE=8

phone_labels, N_PHONES = parseSeqLabels(path_phone_data_per)
data_train_per, _ = findAllSeqs(path_train_data_per, extension='.flac')
dataset_train_non_aligned = SingleSequenceDataset(path_train_data_per, data_train_per, phone_labels)
data_loader_train = torch.utils.data.DataLoader(dataset_train_non_aligned, batch_size=BATCH_SIZE,
                                                shuffle=True)

data_val_per, _ = findAllSeqs(path_val_data_per, extension='.flac')
dataset_val_non_aligned = SingleSequenceDataset(path_val_data_per, data_val_per, phone_labels)
data_loader_val = torch.utils.data.DataLoader(dataset_val_non_aligned, batch_size=BATCH_SIZE,
                                              shuffle=True)

67it [00:00, 16115.29it/s]

/content/CPC_audio
Saved cache file at /content/per_data/pack_master/1h/_seqs_cache.txt



7it [00:00, 1867.22it/s]

Loaded 287 sequences in 4.05 seconds
maxSizeSeq : 273359
maxSizePhone : 207
minSizePhone : 17
Total size dataset 1.0406152430555555 hours
Saved cache file at /content/per_data/pack_master/10min/_seqs_cache.txt





Loaded 212 sequences in 2.49 seconds
maxSizeSeq : 273760
maxSizePhone : 188
minSizePhone : 17
Total size dataset 0.73 hours


### a- Training

Since the phonemes are not aligned, there is no simple direct way to get the classification acuracy of a model. Write and test the three functions ```train_one_epoch_ctc```, ```validation_step_ctc``` and ```run_ctc``` as before but without considering the average acuracy of the model. 

In [21]:
from cpc.feature_loader import loadModel

checkpoint_path = 'checkpoint_data/checkpoint_30.pt'
cpc_model, HIDDEN_CONTEXT_MODEL, HIDDEN_ENCODER_MODEL = loadModel([checkpoint_path])
cpc_model = cpc_model.cuda()
phone_classifier = PhoneClassifier(HIDDEN_CONTEXT_MODEL, N_PHONES).to(device)

Loading checkpoint checkpoint_data/checkpoint_30.pt
Loading the state dict at checkpoint_data/checkpoint_30.pt


In [19]:
parameters = list(phone_classifier.parameters()) + list(cpc_model.parameters())
LEARNING_RATE = 2e-4
optimizer = torch.optim.Adam(parameters, lr=LEARNING_RATE)

optimizer_frozen = torch.optim.Adam(list(phone_classifier.parameters()), lr=LEARNING_RATE)

In [34]:
import torch.nn.functional as F

def train_one_epoch_ctc(cpc_model, 
                        phone_classifier, 
                        loss_criterion, 
                        data_loader, 
                        optimizer):
  
  cpc_model.train()
  loss_criterion.train()

  avg_loss = 0
  avg_accuracy = 0
  n_items = 0
  for step, full_data in enumerate(data_loader):

    x, x_len, y, y_len = full_data

    x_batch_len = x.shape[-1]
    x, y = x.to(device), y.to(device)

    bs=x.size(0)
    optimizer.zero_grad()
    context_out, enc_out, _ = cpc_model(x.to(device),y.to(device))
  
    scores = phone_classifier(context_out)
    scores = scores.permute(1,0,2)
    scores = F.log_softmax(scores,2)
    yhat_len = torch.tensor([int(scores.shape[0]*x_len[i]/x_batch_len) for i in range(scores.shape[1])]) # this is an approximation, should be good enough

    loss = loss_criterion(scores,y.to(device),yhat_len,y_len)
    loss.backward()
    optimizer.step()
    avg_loss+=loss.item()*bs
    n_items+=bs
  avg_loss/=n_items
  return avg_loss

def validation_step(cpc_model, 
                    phone_classifier, 
                    loss_criterion, 
                    data_loader):

  cpc_model.eval()
  phone_classifier.eval()
  avg_loss = 0
  avg_accuracy = 0
  n_items = 0
  with torch.no_grad():
    for step, full_data in enumerate(data_loader):

      x, x_len, y, y_len = full_data

      x_batch_len = x.shape[-1]
      x, y = x.to(device), y.to(device)

      bs=x.size(0)
      context_out, enc_out, _ = cpc_model(x.to(device),y.to(device))
    
      scores = phone_classifier(context_out)
      scores = scores.permute(1,0,2)
      scores = F.log_softmax(scores,2)
      yhat_len = torch.tensor([int(scores.shape[0]*x_len[i]/x_batch_len) for i in range(scores.shape[1])]) # this is an approximation, should be good enough

      loss = loss_criterion(scores,y.to(device),yhat_len,y_len)
      avg_loss+=loss.item()*bs
      n_items+=bs
  avg_loss/=n_items

  return avg_loss

def run_ctc(cpc_model, 
            phone_classifier, 
            loss_criterion, 
            data_loader_train, 
            data_loader_val, 
            optimizer,
            n_epoch):
  for epoch in range(n_epoch):

    print(f"Running epoch {epoch + 1} / {n_epoch}")
    loss_train = train_one_epoch_ctc(cpc_model, phone_classifier, loss_criterion, data_loader_train, optimizer)
    print("-------------------")
    print(f"Training dataset :")
    print(f"Average loss : {loss_train}.")

    print("-------------------")
    print("Validation dataset")
    loss_val = validation_step(cpc_model, phone_classifier, loss_criterion, data_loader_val)
    print(f"Average loss : {loss_val}")
    print("-------------------")
    print()

In [40]:
run_ctc(cpc_model,phone_classifier,loss_ctc,data_loader_train,data_loader_val,optimizer_frozen,n_epoch=10)

Running epoch 1 / 10
-------------------
Training dataset :
Average loss : 32.44543953208657.
-------------------
Validation dataset
Average loss : 32.01081585093132
-------------------

Running epoch 2 / 10
-------------------
Training dataset :
Average loss : 30.99022026328774.
-------------------
Validation dataset
Average loss : 30.300324444522225
-------------------

Running epoch 3 / 10
-------------------
Training dataset :
Average loss : 29.319565432888645.
-------------------
Validation dataset
Average loss : 28.464903420181635
-------------------

Running epoch 4 / 10
-------------------
Training dataset :
Average loss : 27.567655403297262.
-------------------
Validation dataset
Average loss : 26.642856191119876
-------------------

Running epoch 5 / 10
-------------------
Training dataset :
Average loss : 25.832834390493538.
-------------------
Validation dataset
Average loss : 24.82515669546986
-------------------

Running epoch 6 / 10
-------------------
Training dataset :

### b- Evaluation: the Phone Error Rate (PER)

In order to compute the similarity between two sequences, we can use the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). This distance estimates the minimum number of insertion, deletion and addition to move from one sequence to another. If we normalize this distance by the number of characters in the reference sequence we get the Phone Error Rate (PER).

This value can be interpreted as :
\\[  PER = \frac{S + D + I}{N} \\]

Where:


*   N is the number of characters in the reference
*   S is the number of substitutiion
*   I in the number of insertion
*   D in the number of deletion

For the best possible alignment of the two sequences.



In [15]:
import numpy as np

def get_PER_sequence(ref_seq, target_seq):

  # re = g.split()
  # h = h.split()
  n = len(ref_seq)
  m = len(target_seq)

  D = np.zeros((n+1,m+1))
  for i in range(1,n+1):
    D[i,0] = D[i-1,0]+1
  for j in range(1,m+1):
    D[0,j] = D[0,j-1]+1
  
  ### TODO compute the alignment

  for i in range(1,n+1):
    for j in range(1,m+1):
      D[i,j] = min(
          D[i-1,j]+1,
          D[i-1,j-1]+1,
          D[i,j-1]+1,
          D[i-1,j-1]+ 0 if ref_seq[i-1]==target_seq[j-1] else float("inf")
      )
  return D[n,m]/len(ref_seq)
  

  #return PER

You can test your function below:

In [45]:
ref_seq = [0, 1, 1, 2, 0, 2, 2]
pred_seq = [1, 1, 2, 2, 0, 0]

expected_PER = 4. / 7.
print(get_PER_sequence(ref_seq, pred_seq) == expected_PER)

True


## c- Evaluating the PER of your model on the test dataset

Evaluate the PER on the validation dataset. Please notice that you should usually use a separate dataset, called the dev dataset, to perform this operation. However for the sake of simplicity we will work with validation data in this exercise.

In [16]:
import progressbar
from multiprocessing import Pool

def cut_data(seq, sizeSeq):
    maxSeq = sizeSeq.max()
    return seq[:, :maxSeq]


def prepare_data(data):
    seq, sizeSeq, phone, sizePhone = data
    seq = seq.cuda()
    phone = phone.cuda()
    sizeSeq = sizeSeq.cuda().view(-1)
    sizePhone = sizePhone.cuda().view(-1)

    seq = cut_data(seq.permute(0, 2, 1), sizeSeq).permute(0, 2, 1)
    return seq, sizeSeq, phone, sizePhone


def get_per(test_dataloader,
            cpc_model,
            phone_classifier):

  downsampling_factor = 160
  cpc_model.eval()
  phone_classifier.eval()

  avgPER = 0
  nItems = 0 

  print("Starting the PER computation through beam search")
  bar = progressbar.ProgressBar(maxval=len(test_dataloader))
  bar.start()

  for index, data in enumerate(test_dataloader):

    bar.update(index)

    with torch.no_grad():
      
        seq, sizeSeq, phone, sizePhone = prepare_data(data)
        c_feature, _, _ = cpc_model(seq.to(device),phone.to(device))
        sizeSeq = sizeSeq / downsampling_factor
        predictions = torch.nn.functional.softmax(
        phone_classifier(c_feature), dim=2).cpu()
        phone = phone.cpu()
        sizeSeq = sizeSeq.cpu()
        sizePhone = sizePhone.cpu()

        bs = c_feature.size(0)
        data_per = [(predictions[b].argmax(1),  phone[b]) for b in range(bs)]
        # data_per = [(predictions[b], sizeSeq[b], phone[b], sizePhone[b],
        #               "criterion.module.BLANK_LABEL") for b in range(bs)]

        with Pool(bs) as p:
            poolData = p.starmap(get_PER_sequence, data_per)
        avgPER += sum([x for x in poolData])
        nItems += len(poolData)

  bar.finish()

  avgPER /= nItems

  print(f"Average PER {avgPER}")
  return avgPER


In [91]:
get_per(data_loader_val,cpc_model,phone_classifier)

N/A% (0 of 27) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

Starting the PER computation through beam search


100% (27 of 27) |########################| Elapsed Time: 0:10:44 Time:  0:10:44


Average PER 0.9509821691500522


0.9509821691500522

## Exercice 3 : Character error rate (CER) 

The Character Error Rate (CER) is an evaluation metric similar to the PER but with characters insterad of phonemes. Using the following data, run the functions you defined previously to estimate the CER of your model after fine-tuning.

In [23]:
# Load a dataset labelled with the letters of each sequence.
%cd /content/CPC_audio
from cpc.eval.common_voices_eval import SingleSequenceDataset, parseSeqLabels, findAllSeqs
path_train_data_cer = '/content/per_data/pack_master/1h'
path_val_data_cer = '/content/per_data/pack_master/10min'
path_letter_data_cer = '/content/per_data/pack_master/chars.txt'
BATCH_SIZE=8

letters_labels, N_LETTERS = parseSeqLabels(path_letter_data_cer)
data_train_cer, _ = findAllSeqs(path_train_data_cer, extension='.flac')
dataset_train_non_aligned = SingleSequenceDataset(path_train_data_cer, data_train_cer, letters_labels)


data_val_cer, _ = findAllSeqs(path_val_data_cer, extension='.flac')
dataset_val_non_aligned = SingleSequenceDataset(path_val_data_cer, data_val_cer, letters_labels)


# The data loader will generate a tuple of tensors data, labels for each batch
# data : size N x T1 x 1 : the audio sequence
# label : size N x T2 the sequence of letters corresponding to the audio data
# IMPORTANT NOTE: just like the PER the CER is computed with non-aligned phone data.
data_loader_train_letters = torch.utils.data.DataLoader(dataset_train_non_aligned, batch_size=BATCH_SIZE,
                                                shuffle=True)
data_loader_val_letters = torch.utils.data.DataLoader(dataset_val_non_aligned, batch_size=BATCH_SIZE,
                                              shuffle=True)

67it [00:00, 10784.76it/s]

/content/CPC_audio
Saved cache file at /content/per_data/pack_master/1h/_seqs_cache.txt



7it [00:00, 1741.82it/s]

Loaded 287 sequences in 2.68 seconds
maxSizeSeq : 273359
maxSizePhone : 300
minSizePhone : 18
Total size dataset 1.0406152430555555 hours
Saved cache file at /content/per_data/pack_master/10min/_seqs_cache.txt





Loaded 212 sequences in 2.29 seconds
maxSizeSeq : 273760
maxSizePhone : 273
minSizePhone : 29
Total size dataset 0.73 hours


In [24]:
from cpc.feature_loader import loadModel

checkpoint_path = 'checkpoint_data/checkpoint_30.pt'
cpc_model, HIDDEN_CONTEXT_MODEL, HIDDEN_ENCODER_MODEL = loadModel([checkpoint_path])
cpc_model = cpc_model.cuda()
character_classifier = PhoneClassifier(HIDDEN_CONTEXT_MODEL, N_LETTERS).to(device)

Loading checkpoint checkpoint_data/checkpoint_30.pt
Loading the state dict at checkpoint_data/checkpoint_30.pt


In [27]:
parameters = list(character_classifier.parameters()) + list(cpc_model.parameters())
LEARNING_RATE = 2e-4
optimizer = torch.optim.Adam(parameters, lr=LEARNING_RATE)

optimizer_frozen = torch.optim.Adam(list(character_classifier.parameters()), lr=LEARNING_RATE)

In [25]:
loss_ctc = torch.nn.CTCLoss()

In [35]:
run_ctc(cpc_model,character_classifier,loss_ctc,data_loader_train_letters,data_loader_val_letters,optimizer_frozen,n_epoch=10)

Running epoch 1 / 10
-------------------
Training dataset :
Average loss : 17.15224294729166.
-------------------
Validation dataset
Average loss : 16.621312869103598
-------------------

Running epoch 2 / 10
-------------------
Training dataset :
Average loss : 15.531890602378578.
-------------------
Validation dataset
Average loss : 14.840831449246519
-------------------

Running epoch 3 / 10
-------------------
Training dataset :
Average loss : 13.80899079863008.
-------------------
Validation dataset
Average loss : 12.998893340052021
-------------------

Running epoch 4 / 10
-------------------
Training dataset :
Average loss : 12.093861906678526.
-------------------
Validation dataset
Average loss : 11.24841435599666
-------------------

Running epoch 5 / 10
-------------------
Training dataset :
Average loss : 10.522326436076131.
-------------------
Validation dataset
Average loss : 9.72009142654202
-------------------

Running epoch 6 / 10
-------------------
Training dataset :


In [36]:
get_per(data_loader_val_letters,cpc_model,character_classifier)

N/A% (0 of 27) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--

Starting the PER computation through beam search


100% (27 of 27) |########################| Elapsed Time: 0:17:48 Time:  0:17:48


Average PER 0.9113886992183796


0.9113886992183796