### 1. Obtaining decoder states from Whisper

In [1]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en")
audio, samplerate = librosa.load('audio.flac', sr=16000)
input_features = processor(
    audio, sampling_rate=samplerate, return_tensors='pt'
).input_features
output = model.generate(input_features, 
    output_hidden_states=True, 
    #output_scores=True,
    return_dict_in_generate=True)

KeyboardInterrupt: 

In [None]:
import torch
sm = torch.nn.Softmax(dim=-1)
print(len(output.encoder_hidden_states))
print(len(output.decoder_hidden_states))
print(torch.cat([t[0] for t in output.decoder_hidden_states], dim=1))
#print(sm(output.scores[0]))
#output.scores[0].shape

7
10
tensor([[[-0.0091, -0.0120,  0.0026,  ...,  0.0362, -0.0025, -0.0236],
         [ 0.0030,  0.0013,  0.0040,  ...,  0.0098, -0.0004,  0.0036],
         [-0.0014, -0.0302,  0.0142,  ..., -0.0119, -0.0084, -0.0049],
         ...,
         [ 0.0043, -0.0053,  0.0097,  ...,  0.0007, -0.0099, -0.0059],
         [-0.0196, -0.0066, -0.0132,  ..., -0.0010,  0.0002, -0.0069],
         [-0.0015, -0.0299,  0.0106,  ...,  0.0053, -0.0031,  0.0013]]])


In [None]:
import torch.nn as nn
lol = nn.Linear(51864, 768)

### 2. How GlowTTS's alignment works (training)

In [80]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import monotonic_align
import math

# This function is explained in addendum 2.A
def sequence_mask(length, max_length=None):
  if max_length is None:
    max_length = length.max()
  x = torch.arange(max_length, dtype=length.dtype, device=length.device)
  return x.unsqueeze(0) < length.unsqueeze(1)

def generate_path(duration, mask):
  """
  duration: [b, t_x]
  mask: [b, t_x, t_y]
  """
  device = duration.device
  
  b, t_x, t_y = mask.shape
  cum_duration = torch.cumsum(duration, 1)
  path = torch.zeros(b, t_x, t_y, dtype=mask.dtype).to(device=device)
  
  cum_duration_flat = cum_duration.view(b * t_x)
  path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
  path = path.view(b, t_x, t_y)
  path = path * ~F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:,:-1]
  path = path * mask
  return path

def convert_pad_shape(pad_shape):
  l = pad_shape[::-1]
  pad_shape = [item for sublist in l for item in sublist]
  return pad_shape

Suppose we have a set of features `x` where `x.shape == (b=2,n=4,c=5)`, 
where `b=2` is the minibatch size, `n=4` is the sequence length, and `c=5` is the feature dimension.

We wish to align this against another set of features of longer length `z` where `z.shape == (b=1,n=6,c=7`.

To distinguish between the sequence length of `x` and `z` we will refer to them as `n_x` and `n_z` respectively. We will do the same for the feature dimension `c_x` and `c_z`.

* Note that the feature dimension needs to stay the same; we can only perform alignment on a single dimension.

* Also note: the alignment method here is formulated only such that one x-feature can be aligned to one or more z-features (i.e. there are assumed to be more z-features than x-features).

In [81]:
x = torch.Tensor([[ # (The numbers are not meaningful.)
    [0.0, 1.0, 1.0, 1.0, 0.0],
    [1.0, 0.5, 0.2, 0.3, 0.6],
    [0.4, 0.2, 0.9, 0.1, 0.1],
    [0.4, 0.6, 0.7, 0.8, 0.1],
]]) # Each row represents an individual feature; there are 4 features in x.

z = torch.Tensor([[
    [0.0, 1.0, 1.0, 1.0, 0.0, 0.2, 0.1],
    [1.0, 0.5, 0.2, 0.3, 0.6, 0.3, 0.1],
    [0.4, 0.2, 0.9, 0.1, 0.1, 0.4, 0.1],
    [0.4, 0.6, 0.7, 0.8, 0.1, 0.5, 0.1],
    [0.5, 0.7, 0.2, 0.1, 0.3, 0.1, 0.1],
    [0.9, 0.8, 0.6, 0.5, 0.3, 0.3, 0.1],
]]) # And there are 6 features in z.

x = torch.cat((x,x), dim=0)
z = torch.cat((z,z), dim=0)

n_x = x.shape[1]
n_z = x.shape[1]
c_x = x.shape[2]
c_z = z.shape[2]
print(x.shape)
print(z.shape)

torch.Size([2, 4, 5])
torch.Size([2, 6, 7])


The GlowTTS monotonic alignment method relies on modeling each feature in `z` as having been sampled from a  Gaussian (aka normal) distribution, with one distribution corresponding to each feature in the input (multiple `z`s may have been sampled from the same distribution).

*Remember that a Gaussian distribution can be entirely parameterized by its mean (mu) and standard deviation (sigma).*

The GlowTTS text encoder intakes the `n_x` features from x and spits out `n_x` Gaussian distributions; that is, each individual x feature will turn into a pair of mu and sigma.

GlowTTS labels the means as `x_m`.

*For numerical stability, GlowTTS actually generates the natural logarithm of the standard deviation which it labels as `x_logs`, so we'll use that.

In order to sample `z` from these statistics, they must have the same feature dimension `c_z` as `z` itself. 

So the shape of `x_m` and `x_logs` are both `(b = 1, n = n_x = 4, c = c_z = 6)`.

In [82]:
# We simulate the "text encoder" with two linear layers for generating the correct feature dimensions.

text_encoder_mean = nn.Linear(c_x, c_z)
text_encoder_logs = nn.Linear(c_x, c_z)

#with torch.no_grad():
# I have enabled gradients here so we can see where gradients propagate in the calculation.
x_m = text_encoder_mean(x)
x_logs = text_encoder_logs(x)

print('mean:', x_m)
print('std:', x_logs)
print('x_m.shape: ',x_m.shape)
print('x_logs.shape: ',x_logs.shape)

# Needed for following step
x_m = x_m.transpose(1,2)
x_logs = x_logs.transpose(1,2)
z = z.transpose(1,2)

mean: tensor([[[ 0.3989, -0.3484, -0.7459, -0.3408, -0.9484, -0.5320, -0.4384],
         [ 0.1561,  0.2102, -0.7977,  0.5547, -0.1628,  0.4119, -0.8767],
         [-0.0178,  0.2536, -0.7373,  0.0517, -0.3926, -0.0668, -0.2265],
         [ 0.3413, -0.1176, -0.7862,  0.0723, -0.5547, -0.2132, -0.5495]],

        [[ 0.3989, -0.3484, -0.7459, -0.3408, -0.9484, -0.5320, -0.4384],
         [ 0.1561,  0.2102, -0.7977,  0.5547, -0.1628,  0.4119, -0.8767],
         [-0.0178,  0.2536, -0.7373,  0.0517, -0.3926, -0.0668, -0.2265],
         [ 0.3413, -0.1176, -0.7862,  0.0723, -0.5547, -0.2132, -0.5495]]],
       grad_fn=<ViewBackward0>)
std: tensor([[[ 1.1922, -0.4618, -0.1391,  0.1636,  0.5802,  0.5513, -0.1889],
         [ 1.0124, -0.2609, -0.4649, -0.0640,  0.3907,  0.7953, -0.5306],
         [ 0.8080, -0.1062, -0.3231,  0.0330,  0.1573,  0.6409, -0.3014],
         [ 1.0397, -0.3062, -0.1226,  0.0615,  0.3838,  0.6420, -0.4108]],

        [[ 1.1922, -0.4618, -0.1391,  0.1636,  0.5802,  0.5513,

We now have obtained `n_x` Gaussian distributions. We somehow need to map these into our `n_z` features.

To do this, we produce a "likelihood score matrix". For each feature in `z`, we calculate the Gaussian probability density function against ALL of the distributions obtained from the text encoder--that is, against ALL the pairs of mu and sigma. Each cell in the matrix represents the probability that the `z` feature for that cell could have been sampled from the `x_m` and `x_logs` associated with that cell.

Since there are `n_x` Gaussian distributions and `n_z` z-features, this results in an `n_x` by `n_z` likelihood score matrix.

(Actually, it calculates the log PDF instead of the PDF directly--but again that's not very important.)

In [83]:
# The below code is a bit convoluted, but just trust that it calculates a log Gaussian PDF.
x_s_sq_r = torch.exp(-2 * x_logs) # [b, d, t]
print(x_s_sq_r.shape) # [b, d, t]
logp1 = torch.sum(-0.5 * math.log(2 * math.pi) - x_logs, [1]).unsqueeze(-1) # [b, t, 1]
print(logp1.shape)
logp2 = torch.matmul(x_s_sq_r.transpose(1,2), -0.5 * (z ** 2)) # [b, t, d] x [b, d, t'] = [b, t, t']
print(logp2.shape)
logp3 = torch.matmul((x_m * x_s_sq_r).transpose(1,2), z) # [b, t, d] x [b, d, t'] = [b, t, t']
print(logp3.shape)
logp4 = torch.sum(-0.5 * (x_m ** 2) * x_s_sq_r, [1]).unsqueeze(-1) # [b, t, 1]
print(logp4.shape)
logp = logp1 + logp2 + logp3 + logp4 # [b, t, t']
print(logp)
print(logp.shape)

torch.Size([2, 7, 4])
torch.Size([2, 4, 1])
torch.Size([2, 4, 6])
torch.Size([2, 4, 6])
torch.Size([2, 4, 1])
tensor([[[-13.5290, -10.4938, -10.8956, -11.6728, -10.6972, -11.8238],
         [-13.4338, -10.2392, -12.4779, -11.7136, -10.3365, -11.5456],
         [-11.1495,  -8.8243, -10.1350,  -9.8962,  -8.6065,  -9.7008],
         [-11.8729,  -9.5688, -10.2613, -10.4900,  -9.6226, -10.5103]],

        [[-13.5290, -10.4938, -10.8956, -11.6728, -10.6972, -11.8238],
         [-13.4338, -10.2392, -12.4779, -11.7136, -10.3365, -11.5456],
         [-11.1495,  -8.8243, -10.1350,  -9.8962,  -8.6065,  -9.7008],
         [-11.8729,  -9.5688, -10.2613, -10.4900,  -9.6226, -10.5103]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 6])


Next, we use monotonic_align.maximum_path to plot a monotonic maximum sum path through the likelihoods, using dynamic programming [(specifically the Viterbi algorithm, which goes by many other names.)](https://en.wikipedia.org/wiki/Viterbi_algorithm). This path follows a few constraints:

1. Monotonicity -- each z can only depend on the Gaussian distribution associated with the previous z-feature, or the distribution following that distribution--in other words, the model is only allowed to read "left to right". It's not allowed to skip input features or repeat them later, and it should always begin with the first input feature and end on the last feature.
2. Maximum path score -- this is the monotonic path, that, according to the text encoder's generated distributions, will sum to maximum likelihood.

In [84]:
# This masking is explained in the addendum 2.A
x_mask = torch.unsqueeze(sequence_mask(torch.Tensor([4, 4])), 1)
z_mask = torch.unsqueeze(sequence_mask(torch.Tensor([6, 6])), 1)
attn_mask = torch.unsqueeze(x_mask, -1) * torch.unsqueeze(z_mask, 2)
print(x_mask.shape)
print(z_mask.shape)
print(attn_mask.shape)

attn = monotonic_align.maximum_path(logp, attn_mask.squeeze(1)).unsqueeze(1).detach()
print(attn)
print(attn.shape)

torch.Size([2, 1, 4])
torch.Size([2, 1, 6])
torch.Size([2, 1, 4, 6])
tensor([[[[1., 0., 0., 0., 0., 0.],
          [0., 1., 0., 0., 0., 0.],
          [0., 0., 1., 1., 1., 0.],
          [0., 0., 0., 0., 0., 1.]]],


        [[[1., 0., 0., 0., 0., 0.],
          [0., 1., 0., 0., 0., 0.],
          [0., 0., 1., 1., 1., 0.],
          [0., 0., 0., 0., 0., 1.]]]])
torch.Size([2, 1, 4, 6])


Each row index corresponds to a Gaussian distribution (which in turn corresponds to an x-feature), and each column index corresponds to a z-feature.

Now we've obtained our maximum path, all that's left to do is to use our alignment to sub in the distributions corresponding to each z-vector, to get a properly sized set of distributions (columns in the outputted matrix) from which latents can be sampled. This can be done with a simple matrix multiplication.

In [85]:
z_m = torch.matmul(attn.squeeze(1).transpose(1, 2), x_m.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
z_logs = torch.matmul(attn.squeeze(1).transpose(1, 2), x_logs.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
print(z_m)
print(z_logs)

tensor([[[ 0.3989,  0.1561, -0.0178, -0.0178, -0.0178,  0.3413],
         [-0.3484,  0.2102,  0.2536,  0.2536,  0.2536, -0.1176],
         [-0.7459, -0.7977, -0.7373, -0.7373, -0.7373, -0.7862],
         [-0.3408,  0.5547,  0.0517,  0.0517,  0.0517,  0.0723],
         [-0.9484, -0.1628, -0.3926, -0.3926, -0.3926, -0.5547],
         [-0.5320,  0.4119, -0.0668, -0.0668, -0.0668, -0.2132],
         [-0.4384, -0.8767, -0.2265, -0.2265, -0.2265, -0.5495]],

        [[ 0.3989,  0.1561, -0.0178, -0.0178, -0.0178,  0.3413],
         [-0.3484,  0.2102,  0.2536,  0.2536,  0.2536, -0.1176],
         [-0.7459, -0.7977, -0.7373, -0.7373, -0.7373, -0.7862],
         [-0.3408,  0.5547,  0.0517,  0.0517,  0.0517,  0.0723],
         [-0.9484, -0.1628, -0.3926, -0.3926, -0.3926, -0.5547],
         [-0.5320,  0.4119, -0.0668, -0.0668, -0.0668, -0.2132],
         [-0.4384, -0.8767, -0.2265, -0.2265, -0.2265, -0.5495]]],
       grad_fn=<TransposeBackward0>)
tensor([[[ 1.1922,  1.0124,  0.8080,  0.8080,  0.

Notice that for z-features that form a straight row of "1"s in the alignment, the corresponding distributions are the same.

We can also see that even though we had to detach our data to pass them into the monotonic alignment algorithm, we still have gradients coming from the x_m and x_logs statistics. We can calculate loss based on our z_m, z_logs, and original z by taking the negative log likelihood (negative log of the normal PDF):

In [86]:
loss = torch.sum(z_logs) + (
    0.5 * torch.sum(torch.exp(-2 * z_logs) * ((z - z_m) ** 2))) + (
    0.5*math.log(2*math.pi))
print(loss)

tensor(49.5606, grad_fn=<AddBackward0>)


The actual GlowTTS loss function has other terms relating to generative flows and also averages across the batch and sequence axes. Another component of the GlowTTS network is the duration predictor, whose inner workings are considered out of scope here.

### 2.1. Inference

At inference time, we're trying to predict the z directly; however, in the TTS task, we don't have enough information to produce the alignment matrix ourselves, so we don't know which text encoder-outputted distributions can be used to sample which output features. GlowTTS generates the alignment matrix using a separate trained component called the "duration predictor", which predicts (the logarithm of) the number of output z-features to assign to each x-feature, denoted `w_ceil` here.

In [87]:
# We simulate the duration predictor with a linear layer.
proj_w = nn.Linear(c_x, 1)

# The projection layer outputs log(w), the log duration.
logw = proj_w(x).transpose(1,2)
# We mask the exponentiated output by the input lengths.
w = torch.exp(logw) * x_mask 
# Then we take the ceiling to ensure we produce integers (we cannot assign an 
# input to a fractional number of outputs)
w_ceil = torch.ceil(w) 
print(w_ceil)

tensor([[[1., 2., 1., 2.]],

        [[1., 2., 1., 2.]]], grad_fn=<CeilBackward0>)


Because the above duration predictor is not very smart it may output the same duration for every single z-feature. Nonetheless, our next step is to produce the alignment matrix, which can be done using the `generate_path` function.

In [88]:
# y_lengths represents the total z-feature duration of each utterance, clamped to a minimum of 1.
y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long() # long() is important, as sequence_mask expects integer inputs.
z_mask = torch.unsqueeze(sequence_mask(y_lengths), 1)
attn_mask = torch.unsqueeze(x_mask, -1) * torch.unsqueeze(z_mask, 2)

attn = generate_path(w_ceil.squeeze(1), attn_mask.squeeze(1)).unsqueeze(1).float()
print(attn)

tensor([[[[1., 0., 0., 0., 0., 0.],
          [0., 1., 1., 0., 0., 0.],
          [0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 1., 1.]]],


        [[[1., 0., 0., 0., 0., 0.],
          [0., 1., 1., 0., 0., 0.],
          [0., 0., 0., 1., 0., 0.],
          [0., 0., 0., 0., 1., 1.]]]])


Now that we have our alignment matrix, we can use it to get "sub in" our Gaussian distributions corresponding to each z-feature, as we did in training, and sample the z-features.

In [91]:
z_m = torch.matmul(attn.squeeze(1).transpose(1, 2), x_m.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
z_logs = torch.matmul(attn.squeeze(1).transpose(1, 2), x_logs.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
print(z_m.shape)
print(z_mask.shape)
z = (z_m + torch.exp(z_logs) * torch.randn_like(z_m)) * z_mask
z = z.transpose(1,2)
print(z)
print(z.shape)

torch.Size([2, 7, 6])
torch.Size([2, 1, 6])
tensor([[[ 1.2520, -0.0424, -0.3758, -0.2452,  0.9683, -1.5186, -1.5762],
         [-2.8571,  1.3377, -0.2475,  1.1090, -0.9839, -0.8266, -0.5174],
         [-3.6631,  0.8135,  0.5757,  1.8449, -1.3828, -0.1302, -0.4160],
         [ 0.4235,  0.2687,  0.5241, -2.3097, -0.3391,  3.0892, -0.5596],
         [-0.3883, -0.4462, -1.1454, -0.0425, -0.8873,  1.0433, -0.8623],
         [-1.6472,  0.4159, -0.6057, -0.8800, -1.3943, -3.0522, -0.4836]],

        [[ 5.8681, -0.3183,  0.0129, -1.0627, -1.0012, -1.8601, -0.1235],
         [-0.8571,  0.2519, -0.4191,  0.6520, -2.6594,  1.0875, -0.3768],
         [ 0.4435,  0.3587, -0.6444, -0.4242,  0.5319, -0.3713,  0.2347],
         [ 2.8368,  0.3967, -1.1188, -0.0618,  0.1658, -0.5675, -1.6146],
         [-2.2554, -1.0208, -0.8177, -0.6440,  1.0045, -0.3717, -1.0604],
         [ 1.4739,  0.1365, -1.5026, -1.1222, -0.0331, -0.0529, -0.5911]]],
       grad_fn=<TransposeBackward0>)
torch.Size([2, 6, 7])


### Addendum 2.A: sequence_mask and attn_mask

The `sequence_mask` takes a 1D tensor of intended input lengths,
outputting an arrary of binary masks for each length that is True for positions
less than that length, or False for positions greater than that length. (This is so we can multiply the binary mask against quantities in calculations, to ensure the model doesn't use any "accidental information" beyond the intended length.)

As a trivial example, we produce a mask for a single length first:

In [5]:
def sequence_mask(length, max_length=None):
  if max_length is None:
    max_length = length.max()
  x = torch.arange(max_length, dtype=length.dtype, device=length.device)
  return x.unsqueeze(0) < length.unsqueeze(1)
  
sequence_mask(torch.Tensor([4]))

tensor([[True, True, True, True]])

Next, let's look at producing a mask for multiple lengths:

In [50]:
sequence_mask(torch.Tensor([4,1]))

tensor([[ True,  True,  True,  True],
        [ True, False, False, False]])

The first row (corresponding to the input "4") is filled completely,
while the second row has only the first cell marked True.

Now what about attn_mask?

In [79]:
x_mask = torch.unsqueeze(sequence_mask(torch.Tensor([4, 1])), 1) # [2, 1, 4]
z_mask = torch.unsqueeze(sequence_mask(torch.Tensor([6, 5])), 1) # [2, 1, 6]
attn_mask = torch.unsqueeze(x_mask, -1) * torch.unsqueeze(z_mask, 2) # [2, 1, 4, 1] x [2, 1, 1, 6]
print(attn_mask)

tensor([[[[ True,  True,  True,  True,  True,  True],
          [ True,  True,  True,  True,  True,  True],
          [ True,  True,  True,  True,  True,  True],
          [ True,  True,  True,  True,  True,  True]]],


        [[[ True,  True,  True,  True,  True, False],
          [False, False, False, False, False, False],
          [False, False, False, False, False, False],
          [False, False, False, False, False, False]]]])


attn_mask is fed into the alignment algorithm to ensure the model doesn't try to find an alignment for non-existent distributions or z-features.

It uses matrix multiplication to generate pairwise attention masks between each x_mask and z_mask. For the first mask in this sequence, corresponding to an x-length of 4 and z-length of 6, we see that all cells are True. In the second mask in this sequence, corresponding to an x-length of 1, we see that only the first row of cells have True.