# Listen, Attend and Spell

Implementation of the paper from W.Chan and al., submitted in [arXiv](https://arxiv.org/abs/1508.01211) on 5 Aug 2015.
The final paper has been presented in ICASSP 2016 and is available on the [author site](http://williamchan.ca/papers/wchan-icassp-2016.pdf)

## Abstract
> We present Listen, Attend and Spell (LAS), a neural network that learns to transcribe speech utterances to characters. Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits characters as outputs. The network produces character sequences without making any independence assumptions between the characters. This is the key improvement of LAS over previous end-to-end CTC models. On a subset of the Google voice search task, LAS achieves a word error rate (WER) of 14.1% without a dictionary or a language model, and 10.3% with language model rescoring over the top 32 beams. By comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0%. 

## 1. Introduction

State-of-the-art speech recognizers of today are complicated systems comprising of various components - acoustic models, language models, pronunciation models and text normalization. Each of these components make assumptions about the underlying probability distributions they model. For example n-gram language models and Hidden Markov Models (HMMs) make strong Markovian independence assumptions between words/symbols in a sequence. Connectionist Temporal Classiﬁcation (CTC) and DNN-HMM systems assume that neural networks make independent predictions at different times and use HMMs or language models (which make their own independence assumptions) to introduce dependencies between these predictions over time [[1], [2], [3]]. _End-to-end_ training of such models attempts to mitigate these problems by training the components jointly [4, 5, 6]. In these models, acoustic models are updated based on a WER proxy, while the pronunciation and language models are rarely updated [7], if at all. 

In this paper we introduce Listen, Attend and Spell (LAS), a neural network that learns to transcribe an audio sequence signal to a word sequence, one character at a time, without using explicit language models, pronunciation models, HMMs, etc. LAS does not make any independence assumptions about the nature of the probability distribution of the output character sequence, given the input acoustic sequence. This method is based on the sequence-to-sequence learning framework with attention [8, 9, 10, 11, 12, 13]. It consists of an encoder Recurrent Neural Network (RNN), which is named the _listener_, and a decoder RNN, which is named the _speller_. The listener is a pyramidal RNN that converts speech signals into high level features. The speller is an RNN that transduces these higher level features into output utterances by specifying a probability distribution over the next character, given all of the acoustics and the previous characters. At each step the RNN uses its internal state to guide an attention mechanism [10, 11, 12] to compute a "context" vector from the high level features of the listener. It uses this context vector, and its internal state to both update its internal state and to predict the next character in the sequence. The entire model is trained jointly, from scratch, by optimizing the probability of the output sequence using a chain rule decomposition. We call this an _end-to-end model_ because all the components of a traditional speech recognizer are integrated into its parameters, and optimized together during training, unlike _end-to-end training_ of conventional models that attempt to adjust acoustic models to work well with the other ﬁxed components of a speech recognizer. 

Our model was inspired by [11, 12] that showed how end-to-end recognition could be performed on the TIMIT phone recognition task. We note a recent paper from the same group that describes an application of these ideas to WSJ [14]. Our paper independently explores the challenges associated with the application of these ideas to large scale conversational speech recognition on a Google voice search task. We defer a discussion of the relationship between these and other methods to section 5.

[1]: https://arxiv.org/abs/1303.5778 "Speech Recognition with Deep Recurrent Neural Networks"
[2]: https://arxiv.org/abs/1701.02720 "Towards End-to-End Speech Recognition with Recurrent Neural Networks"
[3]: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups"

## 2. Model

In this section, we formally describe LAS. Let $\mathbf{x} = (x_1,...,x_T)$ be the input sequence of ﬁlter bank spectra features and $\mathbf{y} = (\langle \text{sos} \rangle,y_1,...,y_S,\langle \text{eos} \rangle)$, $y_i \in \{a,··· ,z,0,··· ,9,\langle \text{space} \rangle, \langle \text{comma} \rangle, \langle \text{period} \rangle, \langle \text{apostrophe} \rangle, \langle \text{unk} \rangle \}$ be the output sequence of characters. Here $\langle \text{sos} \rangle$ and $\langle \text{eos} \rangle$ are the special start-of-sentence token, and end-of-sentence tokens, respectively, and $\langle \text{unk} \rangle$ are unknown tokens such as accented characters.

LAS models each character output $y_i$ as a conditional distribution over the previous characters $y_{<i}$ and the input signal $\mathbf x$ using the chain rule for probabilities: 

$$
P(y \mid x) = \prod_{i}{P(y_i \mid \mathbf{x}, y_{<i})}                 
\tag{1}\label{eq:1}
$$

This objective makes the model a discriminative, end-to-end model, because it directly predicts the conditional probability of character sequences, given the acoustic signal. 

LAS consists of two sub-modules: the listener and the speller. The listener is an acoustic model encoder that performs an operation called $\operatorname{Listen}$. The $\operatorname{Listen}$ operation transforms the original signal $\mathbf{x}$ into a high level representation $\mathbf{h} = (h_1,...,h_U)$ with $U \le T$. The speller is an attention-based character decoder that performs an operation we call $\operatorname{AttendAndSpell}$. The $\operatorname{AttendAndSpell}$ operation consumes $h$ and produces a probability distribution over character sequences:

$$
\mathbf{h} = \operatorname{Listen}(\mathbf{x}) 
\tag{2}\label{eq:2}
$$
$$
P(y_i \mid \mathbf{x}, y{<i}) = \operatorname{AttendAndSpell}(y_{<i}, \mathbf{h}) 
\tag{3}\label{eq:3}
$$

Figure 1 depicts these two components. We provide more details of these components in the following sections.

![Figure 1](./images/las1.png "Figure 1: Listen, Attend and Spell (LAS) model: the listener is a pyramidal BLSTM encoding our input sequence x into high level features h, the speller is an attention-based decoder generating the y characters from h.")


### 2.1. Listen 

The $\operatorname{Listen}$ operation uses a Bidirectional Long Short Term Memory RNN (BLSTM) [15, 16, 2] with a pyramidal structure. This modiﬁcation is required to reduce the length $U$ of $\mathbf{h}$, from $T$, the length of the input $\mathbf{x}$, because the input speech signals can be hundreds to thousands of frames long. A direct application of BLSTM for the operation $\operatorname{Listen}$ converged slowly and produced results inferior to those reported here, even after a month of training time. This is presumably because the operation $\operatorname{AttendAndSpell}$ has a hard time extracting the relevant information from a large number of input time steps.

We circumvent this problem by using a pyramidal BLSTM (pBLSTM). In each successive stacked pBLSTM layer, we reduce the time resolution by a factor of 2. In a typical deep BLSTM architecture, the output at the $i$-th time step, from the $j$-th layer is computed as follows: 

$$
h^j_i = \operatorname{BLSTM}(h^j_{i−1},h^{j−1}_i ) 
\tag{4}\label{eq:4}
$$

In the pBLSTM model, we concatenate the outputs at consecutive steps of each layer before feeding it to the next layer, i.e.: 

$$
h^j_i = \operatorname{pBLSTM}(h^j_{i−1}, \left[ h^{j−1}_{2i} , h^{j−1}_{2i+1} \right])
\tag{5}\label{eq:5}
$$

In our model, we stack 3 pBLSTMs on top of the bottom BLSTM layer to reduce the time resolution $2^3 = 8$ times. This allows the attention model (described in the next section) to extract the relevant information from a smaller number of times steps. In addition to reducing the resolution, the deep architecture allows the model to learn nonlinear feature representations of the data. See Figure 1 for a visualization of the pBLSTM.

The pyramidal structure also reduces the computational complexity. The attention mechanism in the speller $U$ has a computational complexity of $\mathcal O(US)$. Thus, reducing $U$ speeds up learning and inference signiﬁcantly. Other neural network architectures have been described in literature with similar motivations, including the hierarchical RNN [17], clockwork RNN [18] and CNN [19].

In [9]:
import numpy as np
from keras.layers import Bidirectional, LSTM, Reshape, Lambda
from keras.models import Sequential

def Listen(unit, timesteps, features):
    return Sequential([
        Bidirectional(LSTM(units, return_sequences=True), merge_mode='concat', input_shape=(timesteps, features)),
        Reshape((-1, units * 4)),
        Bidirectional(LSTM(units, return_sequences=True), merge_mode='concat'),
        Reshape((-1, units * 4)),
        Bidirectional(LSTM(units, return_sequences=True), merge_mode='concat'),
        Reshape((-1, units * 4)),
        Bidirectional(LSTM(units, return_sequences=True), merge_mode='concat')
    ])

# units = 256
# timesteps = 55*8 # =440, about 10ms of a 44100Hz sampled sound. Must be a multiple 2^L with L is the number of layers
# features = 40   

l = Listen(256, 55*8, 40)
print(l.output)
l.summary()


Composite(bidirectional_25_input: Tensor[440,40]) -> Tensor[55,512]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_25 (Bidirectio (None, 440, 512)          608256    
_________________________________________________________________
reshape_19 (Reshape)         (None, 220, 1024)         0         
_________________________________________________________________
bidirectional_26 (Bidirectio (None, 220, 512)          2623488   
_________________________________________________________________
reshape_20 (Reshape)         (None, 110, 1024)         0         
_________________________________________________________________
bidirectional_27 (Bidirectio (None, 110, 512)          2623488   
_________________________________________________________________
reshape_21 (Reshape)         (None, 55, 1024)          0         
_________________________________________________________________
bidirect

### 2.2. Attend and Spell 

The $\operatorname{AttendAndSpell}$ function is computed using an attention-based LSTM transducer [10, 12]. At every output step, the transducer produces a probability distribution over the next character conditioned on all the characters seen previously. The distribution for $y_i$ is a function of the decoder state $s_i$ and context $c_i$. The decoder state $s_i$ is a function of the previous state $s_{i−1}$, the previously emitted character $y_{i−1}$ and context $c_{i−1}$. The context vector $c_i$ is produced by an attention mechanism. Speciﬁcally:

$$
c_i = \operatorname{AttentionContext}(s_i, \mathbf{h})
\tag{6}\label{eq:6}
$$
$$
s_i = \operatorname{RNN}(s_{i−1}, y_{i−1}, c_{i−1})
\tag{7}\label{eq:7}
$$
$$
P(y_i| \mathbf{x}, y_{<i}) = \operatorname{CharacterDistribution}(s_i, c_i)
\tag{8}\label{eq:8}
$$

where $\operatorname{CharacterDistribution}$ is an MLP with softmax outputs over characters, and where RNN is a 2 layer LSTM. 

At each time step, $i$, the attention mechanism, $\operatorname{AttentionContext}$ generates a context vector, $c_i$ encapsulating the information in the acoustic signal needed to generate the next character. The attention model is content based - the contents of the decoder state $s_i$ are matched to the contents of $h_u$ representing time step $u$ of $\mathbf{h}$, to generate an attention vector $\alpha_i$. The vectors $h_u$ are linearly blended using $\alpha_i$ to create $c_i$. 

Speciﬁcally, at each decoder timestep $i$, the $\operatorname{AttentionContext}$ function computes the scalar energy $e_{i,u}$ for each time step $u$, using vector $h_u \in \mathbf{h}$ and $s_i$. The scalar energy $e_{i,u}$ is converted into a probability distribution over times steps (or attention) $\alpha_i$ using a softmax function. The softmax probabilities are used as mixing weights for blending the listener features $h_u$ to the context vector $c_i$ for output time step $i$:

$$
e_{i,u} = \langle \phi(s_i), \psi(h_u) \rangle
\tag{9}\label{eq:9}
$$
$$
\alpha_{i,u} = \frac {\exp(e_{i,u})} {\sum_{u'} \exp(e_{i,u'})}
\tag{10}\label{eq:10}
$$
$$
c_i = \sum_u {\alpha_{i,u} h_u}
\tag{11}\label{eq:11}
$$

where $\phi$ and $\psi$ are MLP networks. After training, the $\alpha_i$ distribution is typically very sharp and focuses on only a few frames of $\mathbf{h}$; $c_i$ can be seen as a continuous bag of weighted features of $\mathbf{h}$. Figure 1 shows the LAS architecture.

### 2.3. Learning

We train the parameters of our model to maximize the log probability of the correct sequences. Speciﬁcally:

$$
\tilde{\theta} = \max_{\theta}{\sum_i {\log P(y_i \mid x, \tilde{y}_{<i};\theta)}}
\tag{12}\label{eq:12}
$$

where $ \tilde {y}_{i−1}$ is the ground truth previous character or a character randomly sampled (with 10% probability) from the model, i.e. $\operatorname{CharacterDistribution}(s_{i−1}, c_{i−1})$ using the procedure from [20].

### 2.4. Decoding and Rescoring

During inference we want to ﬁnd the most likely character sequence given the input acoustics:

$$
\hat{y} = \underset{\mathbf{y}}{\operatorname {arg\,max}}\,\log P(\mathbf{y} \mid \mathbf{x})
\tag{13}\label{eq:13}
$$ 

We use a simple left-to-right beam search similar to [8]. We can also apply language models trained on large external text corpora alone, similar to conventional speech systems [21]. We simply rescore our beams with the language model. We ﬁnd that our model has a small bias for shorter utterances so we normalize our probabilities by the number of characters $\left| \mathbf{y} \right|_c$ in the hypothesis and combine it with a language model probability $P_\text{LM}(\mathbf{y})$:

$$
s(\mathbf{y} \mid \mathbf{x}) = \log P(\mathbf{y} \mid \mathbf{x}) \left| \mathbf{y} \right|_c + \lambda \log P_\text{LM}(\mathbf{y})
\tag{14}\label{eq:14}
$$

where $\lambda$ is our language model weight and can be determined by a held-out validation set.


## 3. EXPERIMENTS

We used a dataset with three million Google Voice Search utterances (representing 2000 hours of data) for our experiments. Approximately 10 hours of utterances were randomly selected as a held-out validation set. Data augmentation was performed using a room simulator, adding different types of noise and reverberations; the noise sources were obtained from YouTube and environmental recordings of daily events [22]. This increased the amount of audio data by 20 times with a SNR between 5dB and 30dB [22]. We used 40-dimensional log-mel filter bank features computed every 10ms as the acoustic inputs to the listener. A separate set of 22K utterances representing approximately 16 hours of data were used as the test data. A noisy test set was also created using the same corruption strategy that was applied to the training data. All training sets are anonymized and hand-transcribed, and are representative of Google’s speech trafﬁc. 

The text was normalized by converting all characters to lower case English alphanumerics (including digits). The punctuations: space, comma, period and apostrophe were kept, while all other tokens were converted to the unknown $\langle \text{unk} \rangle$ token. As mentioned earlier, all utterances were padded with the start-of-sentence $\langle \text{sos} \rangle$ and the end-of-sentence $\langle \text{eos} \rangle$ tokens. The state-of-the-art model on this dataset is a CLDNN-HMM system that was described in [22]. The CLDNN system achieves a WER of 8.0% on the clean test set and 8.9% on the noisy test set. However, we note that the CLDNN uses unidirectional LSTMs and would certainly beneﬁt from the use of a BLSTM architecture. Additionally, the LAS model does not use convolutional filters which have been reported to yield 5-7% WER relative improvement [22].

For the $\operatorname{Listen}$ function we used 3 layers of 512 pBLSTM nodes (i.e., 256 nodes per direction) on top of a BLSTM that operates on the input. This reduced the time resolution by 8 = $2^3$ times. The $\operatorname{Spell}$ function used a two layer LSTM with 512 nodes each. The weights were initialized with a uniform distribution $\mathcal{U}(−0.1,0.1)$. Asynchronous Stochastic Gradient Descent (ASGD) was used for training our model [23]. A learning rate of 0.2 was used with a geometric decay of 0.98 per 3M utterances (i.e., 1/20-th of an epoch). We used the DistBelief framework [23] with 32 replicas, each with a minibatch of 32 utterances. In order to further speed up training, the sequences were grouped into buckets based on their frame length [8]. The model was trained until the results on the validation set stopped improving, taking approximately two weeks. The model was decoded using N-best list decoding with beam size of $N = 32$.



## References

[1] A. Graves, A. Mohamed, and G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013. 

[2] A. Graves and N. Jaitly, “Towards End-to-End Speech Recognition with Recurrent Neural Networks,” in International Conference on Machine Learning, 2014. 

[3] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, Nov. 2012. 

[4] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, “Sequencediscriminative training of deep neural networks,” in INTERSPEECH, 2013. 

[5] H.Sak,O.Vinyals,G.Heigold,A.Senior,E.McDermott,R. Monga, and M. Mao, “Sequence Discriminative Distributed TrainingofLongShort-TermMemoryRecurrentNeuralNetworks,” in INTERSPEECH, 2014. 

[6] Y.Miao,M.Gowayyed,andF.Metze,“EESEN:End-to-End Speech Recognition using Deep RNN Models and WFSTbased Decoding,” in Http://arxiv.org/abs/1507.08240, 2015. 

[7] Y. Kubo, T. Hori, and A. Nakamura, “Integrating Deep Neural Networks into Structured Classiﬁcation Approach based on Weighted Finite-State Transducers,” in INTERSPEECH, 2012. 

[8] I. Sutskever, O. Vinyals, and Q. Le, “Sequence to Sequence LearningwithNeuralNetworks,”inNeuralInformationProcessing Systems, 2014. 

[9] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H.Schwen,andY.Bengio,“LearningPhraseRepresentationsusingRNNEncoder-DecoderforStatisticalMachine Translation,” in Conference on Empirical Methods in Natural Language Processing, 2014. 

[10] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in International Conference on Learning Representations, 2015. 

[11] J.Chorowski,D.Bahdanau,K.Cho,andY.Bengio,“End-toend Continuous Speech Recognition using Attention-based RecurrentNN:FirstResults,”inNeuralInformationProcessing Systems: Workshop Deep Learning and Representation Learning Workshop, 2014. 

[12] J.Chorowski,D.Bahdanau,D.Serdyuk,K.Cho,andY.Bengio, “Attention-Based Models for Speech Recognition,” in Neural Information Processing Systems, 2015. 

[13] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,” in International Conference on Machine Learning, 2015. 

[14] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in Http://arxiv.org/abs/1508.04395, 2015.

[15] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”NeuralComputation,vol.9,no.8,pp.1735–1780,Nov. 1997. 

[16] A. Graves, N. Jaitly, and A. Mohamed, “Hybrid Speech RecognitionwithBidirectionalLSTM,”inAutomaticSpeech Recognition and Understanding Workshop, 2013. 

[17] S. Hihi and Y. Bengio, “Hierarchical Recurrent Neural Networks for Long-Term Dependencies,” in Neural Information Processing Systems, 1996. 

[18] J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber, “A Clockwork RNN,” in International Conference on Machine Learning, 2014. 

[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 11 Nov. 1998. 

[20] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks,” in Neural Information Processing Systems, 2015. 

[21] D.Povey,A.Ghoshal,G.Boulianne,L.Burget,O.Glembek, N. Goel, M. Hannenmann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in Automatic Speech Recognition and Understanding Workshop, 2011. 

[22] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2015. 

[23] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, “Large Scale Distributed Deep Networks,” in Neural Information Processing Systems, 2012. 

[24] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Ng, “Deep Speech: Scaling up end-to-end speech recognition,” in Http://arxiv.org/abs/1412.5567, 2014. 

[25] A. Maas, Z. Xie, D. Jurafsky, and A. Ng, “Lexicon-free conversational speech recognition with neural networks,” in North American Chapter of the Association for Computational Linguistics, 2015. 

[26] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk, “Learning acoustic frame labeling for speech recognition with recurrent neural networks,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2015. 

[27] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition,” in INTERSPEECH, 2015.
