### FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

## **Dissertation Preparation**

José Pedro Castro Fonseca

WORKING VERSION



Integrated Masters in Electrical and Copmuter Engineering

Supervisor: João Canas Ferreira

Second Supervisor: Ivo Timóteo

February 11, 2016

# **Contents**

| 1  | Intr                | oduction                                   | 1  |  |  |  |  |  |  |
|----|---------------------|--------------------------------------------|----|--|--|--|--|--|--|
|    | 1.1                 | Background                                 | 1  |  |  |  |  |  |  |
|    | 1.2                 | Motivation                                 | 2  |  |  |  |  |  |  |
|    | 1.3                 | Objectives                                 | 2  |  |  |  |  |  |  |
|    | 1.4                 | People Involved                            | 3  |  |  |  |  |  |  |
| 2  | Prob                | Problem Characterization 5                 |    |  |  |  |  |  |  |
|    | 2.1                 | Theoretical Background                     | 5  |  |  |  |  |  |  |
|    |                     | 2.1.1 Basic Concepts of Machine Learning   | 5  |  |  |  |  |  |  |
|    |                     | 2.1.2 Artificial Neural Networks           | 6  |  |  |  |  |  |  |
|    |                     | 2.1.3 Recurrent Neural Networks            | 10 |  |  |  |  |  |  |
|    |                     | 2.1.4 Long Short-Term Memory Networks      | 11 |  |  |  |  |  |  |
|    | 2.2                 | Proposed Solution                          | 14 |  |  |  |  |  |  |
| 3  | State of the Art 17 |                                            |    |  |  |  |  |  |  |
|    | 3.1                 | LSTM Applications (non-FPGA related)       | 17 |  |  |  |  |  |  |
|    | 3.2                 | Hardware Implementations of LSTM           | 18 |  |  |  |  |  |  |
|    | 3.3                 | •                                          |    |  |  |  |  |  |  |
|    | 3.4                 | Final Overview                             | 20 |  |  |  |  |  |  |
| 4  | Work Plan 21        |                                            |    |  |  |  |  |  |  |
|    | 4.1                 | Approach and Main Tasks                    | 21 |  |  |  |  |  |  |
|    |                     | 4.1.1 Task Description                     | 21 |  |  |  |  |  |  |
|    |                     | 4.1.2 Tentative Task Scheduling            | 23 |  |  |  |  |  |  |
|    | 4.2                 | Software and Hardware resources to be used | 26 |  |  |  |  |  |  |
|    |                     | 4.2.1 Hardware Resources                   | 26 |  |  |  |  |  |  |
|    |                     | 4.2.2 Software Resources                   | 26 |  |  |  |  |  |  |
| 5  | Earl                | ly Work                                    | 27 |  |  |  |  |  |  |
| A  | Lore                | en Ipsum                                   | 29 |  |  |  |  |  |  |
|    |                     | •                                          | 29 |  |  |  |  |  |  |
|    |                     | De onde Vem o Loren?                       | 29 |  |  |  |  |  |  |
| Re | feren               | ices                                       | 31 |  |  |  |  |  |  |

ii CONTENTS

# **List of Figures**

| 2.1 | In Figure 2.1a. each input feature $x_i$ is weighted by its corresponding weight                            |    |
|-----|-------------------------------------------------------------------------------------------------------------|----|
|     | $w_i$ . During the training procedure, these weights are adjusted so that the output                        |    |
|     | y approaches the target value. In Figure 2.1b, we see the diagram of an actual                              |    |
|     | multi polar neuron. The dendrites, where the stimuli are received, plays a role                             |    |
|     | similar to that of the input nodes. The axon transmits the signal to the synaptic                           |    |
|     | terminals, that are similar to the y output. Source: Wikipedia                                              | 7  |
| 2.2 | Three different activation functions. As you can see, the hyperbolic tangent has the                        |    |
|     | same extreme value as the sign step function, but has a smooth transition between                           |    |
|     | them, which can be interpreted as a soft decision in the more ambiguous middle                              |    |
|     | region, reflecting the degree of uncertainty on the decision. On the other hand, the                        |    |
|     | sigmoid function goes from zero to one, and is also smooth like the hyperbolic                              |    |
|     | tangent                                                                                                     | 8  |
| 2.3 | A three layer ANN. We have omitted some of the connections in the hidden layer,                             |    |
|     | for simplification purposes. $\mathbf{w}_1$ represents the weight matrix of the input layer, $\mathbf{w}_2$ |    |
|     | the weight matrix of the connections between the input layer and the hidden layer,                          |    |
|     | and $\mathbf{w}_3$ the weight connections between the hidden and the output layer. $f_{ij}(\dots)$          |    |
|     | is the activation function of the $j$ -th neuron of the $i$ -th layer. Since they can be                    |    |
|     | different, I chose different indexes to each                                                                | 9  |
| 2.4 | A complete LSTM neuron, with all the features as described in [1]. Source: [2] .                            | 12 |

iv LIST OF FIGURES

# **Abbreviations and Symbols**

ANN Artificial Neural Networks
BPTT Backpropagation Through Time
CNN Convolutional Neural Network
CPU Central Processing Unit
FPGA Field-Programmable Gate Array

LSTM Long Short-Term Memory
RNN Recursive Neural Networks

SPSA Simultaneous Perturbation Stochastic Approximation

### **Chapter 1**

### Introduction

Before start reading this report, let me provide you with a quick overview of its mains parts.

In Chapter 1, I will present a conceptual background to the problem at hand, exposing the background that surrounds it 1.1, the motivation for the desired solution 1.2, the objectives of it 1.3 and, lastly, the people that will help me achieve it 1.4.

In Chapter 2, the theoretical foundations will be layed out. Section 2.1.1 presents the basic concepts of Machine Learning for the unsuspicious reader that is not versed in that area of knowledge, and in the following sections, the theoretical details of ANNs, RNNs and LSTMs are presented, in consecutive order. Section 2.1.4.2 provides a quick explanation of the training algorithm that I will use in the final solution, that will be outlined in Section 2.2.

In Chapter 3, the state of the art of LSTMs, their applications 3.1, their hardware implementations 3.2 and the current work regarding the chosen training algorithm 3.3 are presented.

In Chapter 4, the work plan for the Dissertation work is exhaustively detailed in *tasks* in Section 4.1.1, and their temporal arrangement is pictured in a Gantt Chart in Section 4.1.2. Furthermore, the description of the Hardware and Software resources that I will be utilizing during the course of my work is done in Section 4.2

Finally, in Chapter 5 I will present some of the work that I have already done to prepare this thesis.

### 1.1 Background

The Artificial Neural Networks are one of the most popular learning structures in the field of Machine Learning. As the name, itself, suggests, their operation is inspired by the operation of the building blocks of our brain, the **neurons**. In spite of its high performance, one of their shortcomings is the fact that they cannot retain temporal dependences among incoming data samples, thus not beign suitable to process time-serie data, such as audio, video or other kinds of time-varying data streams, where current inputs have a high temporal dependence with previous and future inputs.

2 Introduction

To address this issue, several algorithms have been used, such as Hidden Markov Models (HMM) or Recurrent Neural Networks (RNN), but both of these methods fail to recall temporal dependences that extend over an extended period of time, for the reasons that we will understand in Section **faltacitação**. Thus, in 1997 Hochreiter et al. proposed a novel RNN structure [3], the Long Short-Term Network (LSTM), where a memory cell was introduced, and the input/output/read/write access are controlled by individual gates that are activated both by the incoming data samples, but also by the output from the previous time-step (it is an RNN after all). They are one of the state of the art methods in Deep Learning nowadays, as we can attest in Section 3.1.

### 1.2 Motivation

Hitherto, and to the best of my knowledge, all of the published applications that use LSTM are software based, but the parallel nature of the structure hollers for a dedicated hardware realization, that can dramatically increase its performance, something that, hitherto, has only recently been done once [4] (see Section 3.2 for further details), and although it improves the processing time when compared to a naïve software solution, it still lacks the ability to perform on-chip learning, and the learning process still needs to be performed offline, in a normal CPU, when it also could be speeded up in a dedicated hardware structure.

All these techniques are generally implemented in mainstream processors, making use of general high-level or low-level languages, where all the real parallelism is limited to the number of simultaneous threads that we can run on each physical core, which up to the date generally have between 2-8 cores (mobile devices and general use personal computers),

In order to parallelize the computations to the fullest extent possible, a solution is to port it both to a Graphics Processing Unit (GPU) or even a Field-Programmable Gate Array (FPGA), but the portin process is not entierely automatic, and to have the least performance drop possible, it has to be explicitly programmed in CUDA/OpenCL for GPUs, and a Hardware Description Language for FPGAs, with an increasing level of complexity and low-level details. This way, it is necessary to provide frameworks that can allow an FPGA to quickly reconfigure itself to run these kind of networks on demand, for a particular task that requires them, achieving a supposedly lower computation time, and unburdening the CPU from running it, and saving performance room for running other tasks related, for instance, with the Operating System. Furthermore, when processing an incoming stream of highly dimensional data, or with high throughput, a CPU solution might not be scalable, and could benefit greatly from a dedicated hardware implementation.

### 1.3 Objectives

Taking into account the considerations done in 1.2, I propose to develop a hardware implementation of an LSTM Network, with on-chip learning, improving the performance, capability and flexibility of the existing solution of [4]. Moreover, I also propose to benchmark this solution and

1.4 People Involved 3

try to compare it with the software performance results of [5] and, as a secondary objective, try to use my developed structure to replicate the some of the applications portrayed in 3.1. Lastly, to make justice to the name of the thesis, I will work to make this structure reconfigurable on the go, minimizing the reconfiguration time.

### 1.4 People Involved

Besides me, the candidate to the Master's Degree, there are two more people involved, namely

- **Supervisor** The Professor João Canas Ferreira, auxiliary professor at the Faculty of Engineering of the University of Porto.
- **Second Supervisor Ivo Timóteo**, a Computer Science PhD candidate at Cambridge University, UK, in the field of Artificial Inteligence.

4 Introduction

### Chapter 2

### **Problem Characterization**

### 2.1 Theoretical Background

### 2.1.1 Basic Concepts of Machine Learning

Machine Learning is a field of Computer Science that studies the development of mathematical techniques that allow software to learn autonomously, without an explicit description of each rule of operation. Its goal is to extract latent features from the data that allow an immediate classification of each input data into a particular class – the catch is that there is no previous rule formulation, but instead we have an adaptive model that adjusts is parameters accordingly with the input data it receives, improving the estimates it yields as it receives new input samples.

Let us consider a practical example. For instance, suppose we want to build a program that given an input audio waveform representation of a spoken word, it matches it into a particular word of a dictionary. We could, of course, devise a set of rules and exceptions for each word analysing some of its features (perhaps the Fourier representation of each one, and, from it, manually finding the appropriate rules for each), but apart from it being a very complex task, it wouldn't be a scalable solution, given the enormous number of words in each language. The approach taken by Machine Learning is diametrically different, and instead of manually processing each waveform, we build a large dataset, of size N, containing the waveforms of several words  $[\mathbf{x}_1 \ \mathbf{x}_2 \ \mathbf{x}_2 \ \dots \mathbf{x}_N]$  – we call this dataset the **training dataset** – and we feed it to our model. Each of the i-th data point was previously labelled, and in fact we feed each training data point  $\mathbf{x}_i$  along with its corresponding label  $t_i$ , so that the model can adapt its parameters accordingly to the *target value* it is supposed to classify. This set of labels  $\mathbf{t} = [t_1 \ t_2 \ t_3 \ \dots \ t_N]$  is called the **target vector**.

We are, then, left with the following question: how can the model quantitatively evaluate the quality of its current set of parameters? That could be achieved in a number of ways, but the most usual is using a **Cost Function** that, as the name suggests, measures the cost of each wrong classification of the model. The model then evolves in a way that minimizes the cost function. A usual choice for the cost function is the **sum of squares error**. Mathematically, if  $y_{\theta}^{i} = y_{\theta}(x_{i})$  is

the prediction for the input data point  $x_i$  with label  $t_i$ , given the current set of parameters  $\theta$ , the cost function using this metric is given by

$$J(\theta) = \frac{1}{2} \sum_{i=1}^{N} (y_{\theta}^{i} - t_{i})^{2}.$$
 (2.1)

Sometimes, instead of applying the raw data to the model, we can apply some sort of *preprocessing* to the data to extract the relevant features from it. For instance, instead of just feeding a raw image, we can perform several operations like edge detection or low-pass filtering, and apply them in parallel. In cases of highly dimensional data (i.e. each data vector has a very high number of features), we can apply techniques like **Principal Component Analysis** to reduce the feature space to a smaller dimension one, where the previous features were combined into two or three new features that pose themselves as the most relevant.

The problems described above are, in fact, a subset of the problems that Machine Learning tries to address. These problems are called **classification problems**, because for each input data point, our model tries to fit it into the most appropriate class. But we can also address **regression problems** where the output is not limited to a discrete set of values but rather a continuous interval. On the other hand, the Neural Network that this work will implement addresses a special kind of classification problem, where the classification decision is influenced not only by the current input sample, but also by a given *window of samples* that trail the current sample.

In summary, the most typical setting for a Machine Learning problem is having a large *input* dataset which we use to train our model (i.e. allowing him to dynamically adapt its set of parameters  $\theta$ ), in order to produce an output label  $y_i$  for each of them that minimizes a quality metric, the Cost Function, which can be chosen to the sum of squared differences, the log-loss, or any other appropriate mathematical relationship between the estimate  $y_i$  and the correct label  $t_i$ .

Now that the basic Machine Learning concepts have been presented, I will discuss, henceforth, one of the most important algorithms that address the supervised classification problems, the *Artificial Neural Networks* that will be discussed in Section 2.1.2 as somewhat of a contextual introduction to the main theme of the thesis, which will be Recurrent Neural Networks (Section 2.1.3), namely the **Long Short-Term Memory Networks** (Section 2.1.4), both of which are improvements over the initial formulation of the ANNs. These two last networks branch even further from these set of problems, and are usually employed in *Deep Learning* tasks, where we try to extract even higher level information from data at the expense of increased model complexity.

#### 2.1.2 Artificial Neural Networks

Artificial Neural Networks (ANN) are mathematical structures that, as the name suggests, try to mimic the Human Brain. ANN's building blocks, like their biological counterpart, model the high-level behaviour of biological neurons, in the sense that they neglect unnecessary biological aspects (such as modeling all the voltages across the neuron and all its electromagnetic interactions), and only retain its fundamental underlying mathematical function, which is a weighted linear

combination of its inputs subject to a *activation function*, i.e. a function that outputs a decision value depending on its inputs. Mathematically, we have

$$y = f(\mathbf{w}^T \mathbf{x} + b_0) \tag{2.2}$$

where  $\mathbf{w} = [w_1 \ w_2 \ w_3 \ \dots \ w_n]$  is the input weight vector,  $\mathbf{x} = [x_1 \ x_2 \ x_3 \ \dots \ x_n]$  the input vector,  $b_0$  is the bias factor and  $f(\cdot)$  is the chosen activation function. Furthermore, we call the scalar quantity  $a = \mathbf{w}^T \mathbf{x} + b_0$  the **activation**, since its value determines how the activation function will behave. Figure 2.1 exemplifies the roles of these variables within our neuron model, and compares each part of it with the biological counterpart.



Figure 2.1: In Figure 2.1a. each input feature  $x_i$  is weighted by its corresponding weight  $w_i$ . During the training procedure, these weights are adjusted so that the output y approaches the target value. In Figure 2.1b, we see the diagram of an actual multi polar neuron. The dendrites, where the stimuli are received, plays a role similar to that of the input nodes. The axon transmits the signal to the synaptic terminals, that are similar to the y output. Source: Wikipedia

As far as the activation function is concerned, we can have several types. An immediate choice would be the **Binary Step Function** that outputs -1 if the activation is **below** a given threshold and 1 otherwise. There can also be **real valued activation functions**, whose output is not binary, but rather that of a continuously differentiable function, such as the logistic sigmoid  $\sigma(a) = \frac{1}{1+e^{-a}}$  or the hyperbolic tangent tanh(a). This aspect will prove useful for the usual training methods, that involve the computation of derivatives. In Figure 2.2 these activation functions are plotted.



Figure 2.2: Three different activation functions. As you can see, the hyperbolic tangent has the same extreme value as the sign step function, but has a smooth transition between them, which can be interpreted as a *soft decision* in the more ambiguous middle region, reflecting the degree of uncertainty on the decision. On the other hand, the sigmoid function goes from zero to one, and is also smooth like the hyperbolic tangent

.

A neuron by itself can be thought of as a simple linear regression, where we optimize the weight of each feature according to a target value, or function. While important in some applications, the main interest in ANN is to evaluate increasingly more complex models, and not a simple linear regression. This is achieved by *chaining* nodes to one another, connecting the output of a given node to one of the inputs of another. We call *layers* to a group of these nodes that occupy the same hierarchical position. There can be any number of layers, with any number of nodes, but most implementations generally have 3 layers: the *initial* layer, the *hidden* layer (in the middle) and the *output* layer. Figure 2.3 suggests a possible structure for a 3 layer ANN.



Figure 2.3: A three layer ANN. We have omitted some of the connections in the hidden layer, for simplification purposes.  $\mathbf{w}_1$  represents the weight matrix of the input layer,  $\mathbf{w}_2$  the weight matrix of the connections between the input layer and the hidden layer, and  $\mathbf{w}_3$  the weight connections between the hidden and the output layer.  $f_{ij}(\dots)$  is the activation function of the *j*-th neuron of the *i*-th layer. Since they can be different, I chose different indexes to each.

Regarding the training of ANNs, is is performed through a two-step process: first, a **feed-forward** step where the input is applied, and the activations are evaluated in succession up to the output neurons; then, the perform the **backpropagation** step, where we calculate the errors in each of the nodes, but now from the output to the input: the weights are updated and optimized using an iterative method called *Gradient Descent*, where if  $\tau$  is the current time step, the next update on the weight matrix  $\mathbf{w}^{(\tau+1)}$  is given by

$$\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta \nabla E(\mathbf{w}^{(\tau)}) \tag{2.3}$$

where  $E(\cdot)$  is the error function. As we can see, the weight matrix is moved in the direction that minimizes the error function the most, and  $\eta$  controls how fast this is achieved, being the reason why it is called the **learning rate**.

The computation of gradient of the error function comprises the evaluation of its derivatives with respect to each weight of all network connections,  $w_{ij}$ . They are

$$\frac{\partial E}{\partial w_{ii}} = \delta_j f(a_i) \tag{2.4}$$

where  $f(\cdot)$  is the activation function of the neuron and

$$\delta_j = f'(a_j) \sum_k w_{kj} \delta_k. \tag{2.5}$$

The interpretation of these equations is simple. If  $w_{ji}$  is the weight of the connection between the neuron j we are considering and a neuron i in a previous layer, then the sum over k relates to all the neurons in the *next* layer to which j connects: this way, since the update on  $w_{ji}$ , according to 2.3, is given by

$$w_{ji}^{(\tau+1)} = w_{ji}^{(\tau)} - \eta \frac{\partial E}{\partial w_{ji}}$$

$$(2.6)$$

we see that, from 2.4 it simply is the product of the error of the current neuron,  $\delta_j$ , with the output of the previous neuron  $f(a_i)$ . In turn, from 2.5, we see the recurrence relationship between it and the weighted sum of all posterior neurons that connect to it. Hence, the name backpropagation is now clear: we are, in fact, propagating the errors backwards into the neuron of interest, weighted by the corresponding weight, but now *backwards* instead of forward, as before. For the output units, the  $\delta_j$  is simply the difference between the produced output and the corresponding label for that sample. This two-step process is performed for every data point in out dataset. For a complete proof of the above formulas, see [6, chap. 5.3.1].

#### 2.1.3 Recurrent Neural Networks

A Recurrent Neural Network (RNN) is, essentially, a regular ANN where some neurons (especially in the hidden layer) have *feedback connections* to themselves, i.e. their outputs are fed as inputs. The relevance of this different structure is the possibility to retain *sequence* information about the data. Before, each incoming data point only contributed to the training of the network, but no information about the correlation between themselves and the data points that preceded them did not influence the training step. They were regarded as if no temporal relationship existed, and therefore each data point is conditionally independent of any other. This is obviously not necessarily truth, and in fact there are many cases, where the correlation between data points is high for those closely spaced in time, in which it is actually completely false, as in video signals, audio signals, or other kinds of *temporal sequences* of data. Therefore, the feedback connection of the neuron to himself acts as a kind of *memory element* that takes into account in the present decision, the history of decisions previously taken, and hence the previous data. Figure ?? suggests a possible structure for a neuron of a hidden layer in an RNN, and also an alternate representation, where the structure is unfold through time.

The training of an RNN is usually performed using a variation of Backpropagation, called **Backpropagation Through Time** (BPTT), that as the name implies, performs the same backpropagation procedure discussed for the ANNs, but now taking into account the unfolding of the

network through a fixed training epoch T like Figure ??. Due to this very fact, this training procedure is memory and performance consuming, and so it will not be used in my final work, but instead a novel approach, the **Simulatneous Perturbation Stochastic Approximation**.

Even though RNNs outperform static ANNs in sequence recognition problems [7], they fail to retain long-term dependencies. Of course that the weight training process is itself a form of memory, but the problem is that the weight update is much slower than the activations [8], and therefore this memory only retains short-term dependencies. This is because of the so-called **Vanishing Gradients Problem** [9, 8], where the error decays exponentially through time, and the impact of previous incoming data points on the training of the weights, and thus the current decision, quickly decreases.

### 2.1.4 Long Short-Term Memory Networks

To overcome the issue of failing to remember long-term dependencies, Hochreiter and Schmidhüber proposed, in 1997, a novel approach to the RNNs called the *Long-Short Term Network* [3]. This section explains the main idea of this approach 2.1.4.1, and also how it is trained2.1.4.2, serving as a support basis for the work of this thesis. Although originally formulated in 1997, its formulation has been incrementally updated in [10] and [11], and the most current version is the one in [1]. One of the inital proposers of LSTM, Prof. Jürgen Schmidhüber, did a survey on the most common variations of the structure [2] last year, and this will the basis of this short theoretical presenation, as well as the work that will be developed in this thesis.

#### 2.1.4.1 Structure, Operation and Equations

A single LSTM neuron is presented in Figure 2.4. As we can see from the picture, we still have the recurrent connections from the regular RNNs, but now there are multiple entry points that control the flow of information through the network. Although ommitted from the picture, all the gates are biased, as is suggested in Equations ??. The main components, their role and relevance, are explained as follows



Figure 2.4: A complete LSTM neuron, with all the features as described in [1]. Source: [2]

- Input Gate this is the input gate, where the importance of each feature of the input vector at time t,  $\mathbf{x}^t$ , and the output vector at the previous time step  $\mathbf{y}^{t-1}$  is weighed in, producing an output  $\mathbf{i}^t$ .
- Block Input Gate as the name implies, this gate controls the flow of information from the input gate to the memory cell. It also receives the input vector and the previous output vector as inputs, but it does not have peephole connections and its dynamics are controlled by a different set of weights. The activation function of this gate can be any, but the most common choice is the Hyperbolic Tangent.
- Forget Gate its role is to control the contents of the Memory Cell, either to set them or reset them, using the *Hadamard Elementwise* matrix multiplication of its output at time t,  $\mathbf{c}^{(t)}$ , with the contents of the memory unit at the previous time step,  $\mathbf{c}^{(t-1)}$ . The activation function of this gate is **always sigmoid**.
- Output Block Gate this gate has a role very similar to that of the Block Input Gate, but now it controls the information flow *out* of the LSTM neuron, namely the activated Memory Cell output.
- **Memory Cell** the cornerstone of the LSTM neuron. This is the memory element of the neuron, where the previous state is kept, and updated accordingly to the dynamics of the gates that connect to it. Also, this is where the peephole connections come from:
- Output Activation the output of the Memory Cell goes through this activation function that, as the gate activation function, can be any, but the *hyperbolic tangent* is the most common choice.
- **Peepholes** direct connections *from* the memory cell that allow for gates to 'peep' the states of the memory cell. They were added after the initial 1997 formulation, and their absence was proven to have a minimal performance impact [2].

After these small conceptual definitions, that allow us to grasp some intuition on the operation of a *single* LSTM cell, I can present the overview of a *layer* of LSTM neurons and their formal mathematical formulation, that will be needed both for the high-level model and the HDL description. The operation of each set of gates of the layer is given by the following set of equations

$$\mathbf{z}^{(t)} = g(\mathbf{W}_z \mathbf{x}^{(t)} + \mathbf{R}_z \mathbf{y}^{(t-1)} + \mathbf{b}_z)$$
(2.7)

$$\mathbf{i}^{(t)} = \sigma(\mathbf{W}_i \mathbf{x}^{(t)} + \mathbf{R}_i \mathbf{v}^{(t-1)} + \mathbf{p}_i \odot \mathbf{c}^{(t-1)} + \mathbf{b}_i)$$
(2.8)

$$\mathbf{f}^{(t)} = \sigma(\mathbf{W}_f \mathbf{x}^{(t)} + \mathbf{R}_f \mathbf{y}^{(t-1)} + \mathbf{p}_f \odot \mathbf{c}^{(t-1)} + \mathbf{b}_f)$$
(2.9)

$$\mathbf{c}^{(t)} = \mathbf{i}^{(t)} \odot \mathbf{z}^{(t)} + \mathbf{f}^{(t)} \odot \mathbf{c}^{(t-1)}$$
(2.10)

$$\mathbf{o}^{(t)} = \sigma(\mathbf{W}_o \mathbf{x}^{(t)} + \mathbf{R}_o \mathbf{y}^{(t-1)} + \mathbf{p}_o \odot \mathbf{c}^{(t)} + \mathbf{b}_o)$$
(2.11)

$$\mathbf{c}^{(t)} = \mathbf{o}^{(t)} \odot h(\mathbf{z}^{(t)}) \tag{2.12}$$

The *i*-th element of the previous vectors in bold corresponds to the value of the gate of the *i*-th neuron of the layer, which is a very convenient and compact representation of the whole layer. Furthermore, if the layer has N LSTM neurons and M inputs (i.e. the size of the layer that precedes this), we see that the input weight matrices  $\mathbf{W}_*$  and the recurrent weight matrices  $\mathbf{R}_*$  all have size  $N \times M$ , and that the bias weight matrices  $\mathbf{b}_*$ , the peephole weight matrices  $\mathbf{p}_*$  and the matrices  $\mathbf{z}^{(t)}$  through  $\mathbf{c}^{(t)}$  have size  $N \times 1$ . Although useful for a high-level description in a programming language, the matrix representation may not be suitable for a direct HDL port. In that case, in order to represent the operation of the *i-single neuron*, all the above equations still hold, but instead of *vectors* of outputs, we will have a single scalar output, and we only use the appropriate row in the weight matrices  $\mathbf{W}_*$ ,  $\mathbf{R}_*$  and the remaining.

#### 2.1.4.2 Training – SPSA

Since this work aims to perform on-chip learning, it is important to find a suitable learning scheme. Since I am aiming at an hardware implementation, and although the memory resources of nowadays FPGAs are abundant, one must find an algorithm that uses the less memory as possible (at the smallest performance cost possible), in order to use the additional memory, for instance, to add additional LSTM cells that can improve the performance of our system. That being said, we see that the usual training algorithms for LSTM cells – i.e. Real-Time Recurrent Learning (RTRL), Truncated Backpropagation Through Time (BPTT) or a mixture of both [1] [3] [2] – usually involve the storage of the deltas of each layer for every time instant in the training epoch (from 0 to T), which is a highly non-scalable solution both in terms of memory and performance. A most efficient approach to training times series dependant structures like LSTM is the use of **Simultaneous Perturbation Stochastic Approximation** (SPSA) [12]. The weight update for the i-th

weight at the time step t is given by

$$\Delta w_i^{(t)} = \frac{J(\mathbf{w}^{(t)} + \beta \mathbf{s}^{(t)}) - J(\mathbf{w}^{(t)})}{\beta s_i^{(t)}}$$
(2.13)

where  $\beta$  is the magnitude of the perturbation to be introduced,  $\mathbf{s}^{(t)}$  is a *sign vector* whose *i*-th element,  $s_i^{(t)}$ , is either -1 or 1. This way, we see that every weight is **randomly** incremented either by  $-\beta$  or  $\beta$ , and we only need to keep a duplicate of the weight matrix with the perturbations, and we only need to evaluate the cost function twice per incoming sample. As for the *update rule* is concerned, we have

$$w_i^{(t+1)} = w_i^{(t)} - \eta \Delta w_i^{(t)}$$
 (2.14)

where  $\eta$  is the learning rate, and  $\Delta w_i^{(t)}$  is the update for the *i*-th weight evaluated in 2.13. According to the analysis performed in [13], performing a second-order Taylor expansion at  $\mathbf{w} = \mathbf{w}^{(t)}$ , and taking the expected value of the equation, we get that

$$E(\Delta w_i^{(t)}) = \frac{\partial J(\mathbf{w}^{(t)})}{\partial w_i^{(t)}}$$
(2.15)

that is, the weight update approximates the gradient of the cost function relative to that weight, and so the learning rule is a special form of *Stochastic Gradient Descent*.

One last comment to be made concerning the update rule, is the hipothetical need for **limits** to the weight values, when the update rule exceeds  $|w_{\text{max}}|$  (in that case, we set  $w_i^{(t+1)} = \pm w_{\text{max}}$ ), which sometimes might be needed if the behaviour of the weight update is not appropriate [13].

### 2.2 Proposed Solution

Given the previous discusson of how LSTM neurons work, their benefits when compared to other RNN architectures, this work aims to conceive an efficient dedicated FPGA implementation with on-chip learning that, hopefully, surpasses the training/testing time of an equivalent CPU implementation in a general high-level language or from a framework.

As discussed in Chapter 3, hitherto there is only one FPGA implementation of LSTM cells [4], but there is no on-chip learning on the architecture (the weights are learned offline, and loaded in the FPGA before operation, and they are not changed while the FPGA system is operating), and my proposed work will implement an efficient LSTM implementation *with* the ability to learn on-chip, which is an innovative and useful feature for the LSTM architecture on a dedicated hardware implementation. I will achieve that using the SPSA method described in 2.1.4.2, and inspired in the previous work of [13].

This solution will combine the benefits of a dedicated hardware implementation, where the inherent parallelism so obviously visible in the network architecture can be fully exploited something that, predominantly, yields a much higher performance in terms of computation time than a

15

naïve CPU implementation, as in [13] where a  $21\times$  speed-up against an ARM CPU was acheived.

### **Chapter 3**

### State of the Art

Over the course of this chapter, I am going to present an overview of the most recent developments related to the work of this thesis, both in terms of existing dedicated hardware implementations 3.2, the most relevant work in adapting suitable training algorithms to hardware 3.3 and also some of the most relevant applications of LSTM 3.1, which are not FPGA-based, but demonstrate how LSTM is useful by itself, and how well it competes with other Machine Learning algorithms in terms of long time-series dependences in data.

### 3.1 LSTM Applications (non-FPGA related)

LSTM Networks are nowadays one of the state of the art algorithms in deep-learning, and their performance is superior to that of other kinds of RNNs and Hidden Markov Models, both of which are generally used to address the same set of problems where LSTM are employed, namely predicting and producing classification decisions from time-series data **citation needed!**. A very comprehensive description of applications can be found in one of the initial authors webpage dedicated to the subject <sup>1</sup>. I will now enumerate some of the bleeding edge applications of LSTM.

- Handwriting Recognition and LSTM-based network [14], submitted by a team of the Technische Universität München, won the 2009 ICDAR Handwriting Recognition Contest, achieveing a recognition rate of up to 91% [15]. LSTMs have also proven to surpass HMM based models in terms of optical character recognition in printed text, as [16] suggests.
- Speech Recognition an architecture[17] proposed by Graves et al. in 2013 achieved an astonishing 17.7% of accuracy on the TIMT Phoneme Recognition Benchmark, which up to the date is a new record. Furthermore, it has also been used for large scale acoustic modelling of speech [18].
- Handwriting Synthesis a comprehensive study by Graves shows, among other sequence generation tasks such as Text Prediction, the use of LSTM to produce synthetic handwriting, that looks incredibly similar to human-produced handwriting [19].

<sup>&</sup>lt;sup>1</sup>http://people.idsia.ch/ juergen/rnn.html

• Translation – LSTM was used by Sutskever et al. (Google) to perform sequence-to-sequence translation on the WMT'14 dataset, translating English to French with a close to state of the art score [20].

- Biomedical Applications this network architecture was used in a protein homology detection scheme [21] using the SCOP 1.53 benchmark dataset, displaying a staggering performance when compared to other methods. Similarly, a recent article from 2015 by Sønderby et al. suggested the use of standalone LSTM and also Convolutional LSTM to perform subcellular localization of proteins, given solely their protein sequence, with an astounding accuracy of 0.902 [22]
- Music Analysis and Composition Lehner et al. proposed a low-latecy solution based on LSTM, suitable for real-time detection of a singing voice in music recordings [23] whose performance surpassed other baseline methods, with lower latency. In terms of music transcription from audio data, there is a study that proposes the use of LSTM cells to perform a transcription of piano recordings to musical notes [24], in order to automate music transcription. The model was trained with recordings of both acoustic pianos and synthesized pianos, and the labelling was performed using an associated MIDI file for each piece that was used in the training, showing promising results. [25] suggests the use of LSTM to perform autonomous computer music composition, and Eck and Schmidhüber proposed LSTM to perform Blues improvisation in [26].
- Video and Image Analysis Vinyals et al., at Google, proposed and LSTM network for image captioning, preceded by a Convolutional Neural Network to apply preprocessing to the images [27]. This way, the LSTM works as *sentence generator*, that captions the images, with state of the art performance. Besides image captioning, video captioning is also an interesting topic. Venugopalan et al. recently proposed a CNN + LSTM architecture to translate video sequences to natural language [28], using the Microsoft Research Video Description Corpus as a dataset. There are other similar studies that combine image and video captioning [29].

### 3.2 Hardware Implementations of LSTM

Hitherto, there is but one actual implementation of an LSTM network in hardware, published recently (November 2015) by Chang et al. [4] in the Computing Research Repository (CoRR). It consists on the proposal of an LSTM cell architecture for dedicated hardware, targeting a Xilinx® Zedboard implementation. It uses a character-level language model from Andrej Karpathy, written in Lua using the Torch7 framework (the Lua calls are implemented in C, so no performance is lost).

Although the article is not clear on whether there is active learning by the ARM CPU – the authors refer that the CPU loads the weights before operation, and that it changes them during

operation, although how and why that change is done is not even clearly explained, neither mathematically nor conceptually –, **there is no on-chip learning module** in the FPGA according to the description provided.

The implementation itself makes an extensive use of the Multiply-and-Accumulate units (MAC) of the FPGA, which since they are limited in number in the target platform, limits the number of neurons that we can deploy in parallel. The authors report a nearly 75% usage for only 8 LSTM neurons. Although an apparently excessive number, I am left to scrutinize whether the usage of MACs is compulsory to obtain a better performance than a CPU or if it can be discarded, allowing for smaller cells. This way, we can have a layer of more neurons occupying the same area and, hopefully, the same, or even less, resources within the FPGA.

### 3.3 Training Algorithms and Implementation

As stated in 3.2, the work of [4] does not feature on-chip learning at FPGA level, although there are a handful proposed solutions for it in recent literature. I will particularly look into the ones that use SPSA (see Section 2.1.4.2), since that is the training algorithm of choice for my proposed solution, and also to the ones that particularly apply SPSA to the training of Neural Networks.

The SPSA algorithm was initially published by Spall in [12], and its theoretical details are outlined in Section 2.1.4.2. As of applications of it to the training of general Neural Networks, the earliest examples come from 1995 and 1996 on [30, 31] where SPSA is used to train a VLSI *Analog* Neural Networks, a time where the memory resources of digital circuitry were limited, and so most of these structures were analog-based. Its adequacy was also established for **control problems**, such as those proposed in [32], where a Time Delay Neural Network is used to control an unknown plant in a linear feedback system.

In 2005, Maeda and Wakamura published a proposed SPSA hardware implementation [13] to train an Hopfield Neural Network in an FPGA (and thus a digital system), achieving promising results in an Altera ® FPGA. The article carefully delineates the approach taken, and also the hardware architecture designed, so it is a very good reference for the design that I will have to implement.

Furthermore, a 2013 article by Tavear et al. [5] proposes, for the first time, using SPSA to train LSTM Neurons, although the article focuses on proving the suitability of SPSA to LSTM, and no factual hardware implementation is done or proposed. The authors simply demonstrate the suitability using conceptual arguments and by building a software model of an SPSA-trained LSTM network, and by comparing both the performance and computing speed of their model with the results achieved by Hochreiter et al. in [21]. Since the forward phase in both regular LSTM and SPSA-trained LSTM is the same, the computation time suffers no performance penalty whatsoever and the learning ability is preserved to a high degree, showing that SPSA is a valid alternative do BPTT and other similar and more common training schemes.

State of the Art

#### 3.4 Final Overview

As we could attest from this small literature survey, although there is already an hardware implementation os LSTM, there is still a good deal of room for improvement by adding on chip-learning to the system, and also to restric it to a smaller use of the FPGA resources, allowing it to accomodate a more complex network. Furthermore, [5] shows that, at least for that particular case, the LSTM network doesn't suffer a great performance impact from using SPSA training, as opposed to the more common BPTT, and [13] showed that an SPSA hardware implementation is feasible. These three conclusions combined indicate that the idea of a hardware LSTM network with on-chip learning using SPSA is possible.

Besides the proposed hardware implementation, it would be advantageous to perform *bench-marks* to see how well it compares with software solutions, and Section 3.1 suggests a handful of examples to test my final solution.

### **Chapter 4**

### Work Plan

### 4.1 Approach and Main Tasks

Dwight Eisenhower, United States' President from 1953 to 1961 and Army Officer, said that 'no battle was ever won according to plan, but no battle was ever won without one'. Hence, although this initial planning draft might not be religiously followed in all its extent, it is important to break down the work in **tasks** so that I and my supervisors can keep track of what has to be done, and establishing deadlines for the small achievements that, together, will build up to the successful completion of the proposed work of this thesis. The detailed explanation of each task is provided in Section 4.1.1 and the **schedule** for them, where their approximate begin and end weeks are marked, is represented by a *Gantt Chart* in Section 4.1.2.

#### 4.1.1 Task Description

The work will be partitioned in *five* large groups of tasks, and each of these groups will be further partitioned in smaller subtaskswhich are the actual assignments that I will be completing. We can think of the group of task as a sort of *logical* grouping of the subtasks, according both to their relatedness and also to their precedence relationships. The task groups, and their related subtasks, are as follows

- T1 Preparation This stage is the initial phase of the work, where I will essentially setup my working environment and finish the Python modules for the LSTM network that will be used in the validation of my Verilog modules
  - T1.1 Complete Python Class with SPSA here I will finish the Python modules that I have already developed as a preparation for this work, whose details are presented in Section 5, by adding the SPSA training algorithm, as opposed to the current Backpropagation Throgh Time version.
  - T1.2 Reproduce the Benchmark Results with the Python Class after developing and testing the Python Class with SPSA, I will try to reproduce with it the results

22 Work Plan

of [5], in order for me to have a high-level model whose performance is comparable to [5, 21], and thus provide a good performance baseline for this and other future benchmark tests that I would like to run in order to compare my FPGA solution with a software solution.

- T1.3 Setup FPGA Development Board this is a farily technical step, where I setup my working environment, and make sure that everything is working correctly and is properly configured. As will be pointed out in Section 4.2, the FPGA development board of choice will be Xilinx ®Zedboard.
- T1.4 Familiarize with the Zedboard before any actual use, I should synthesize and run some sample designs, and perhaps write a small Verilog design, and go through the full design flow, from behavioural simulation, to synthesis. This will allow me to know exactly what steps to take when I will need to develop and test my Dissertation related modules, and focus more on the solving the problem at hand rather than solving technical details.
- **T2 HDL Module Development** this is where I will port the ideas from Section 2 to fully functional HDL blocks, using Verilog and targeting a Zedboard technology.
  - T2.1 Design Planning before performing and HDL coding, it is favourable to first assemble the system design strategy in paper. This is where, together with my supervisor, I will outline the internal organization of my system, and make the fundamental design decisions. Of course, this task will prolong during all the T2 extent, since some design changes to the original plan will have to be performed as I find bugs in my early Verilog implementations that force a design correction.
  - T2.2 Design SPSA Module this task comprises the Verilog description of the module that will update the network weights using SPSA, and also the development of a testbench to validate my description. Furthermore, I will compare the output of this model with the golden output produced by a standalone SPSA Python Class previously developed in T1.
  - T2.3 Design LSTM Neuron/Layer Module similar to the previous task, but now I am aiming to implement the actual LSTM Neuron or a layer of LSTM Neurons. This is a design decision that will need to be set in T2.1, to asses whether it is justifiable to have dedicated multiplication units assigned to each neuron, or we can perform matrix-like batch multiplications and treat the LSTM as a batch of Neurons, i.e. as a layer. Like in the previous task, a testbench will also be developed, and the module validate both with it and the Python golden output.
  - T2.4 Integrate and Validate both LSTM and SPSA modules here, after the modules designed in T2.2 and T2.3 are validated individually, I will integrate them together in a higher level module, and perform a behavioural simulations of the new

integrated module, as well as a verification procedure where the results from [5] will hopefully be reproduced and compared against the Python golden output.

- T3 FPGA Synthesis of the Verilog Code after having Verilog code that passed all the validation tests and whose correct functionality is, therefore, asserted, I will synthesize the Verilog modules to FPGA RTL blocks, and perform a new simulation on these RTL syntesized code, that unlike the previous behavioural simulation, now has timing information resulting from the FPGA block mapping, and the place and routing of those mapped blocks.
  - T3.1 Synthesis of the SPSA Module the module developed in T2.2 is now syet bar height=.6thesized and validated
  - T3.2 Synthesis of the LSTM Neuron/Layer Module same as the previous task, but now on the module of T2.3
  - T3.3 Integration of the Modules and Validation the modules synthesized in the previous tasks are integrated and validated using the same testbench as in T2.4.
- T4 Benchmarking the Design at this stage, we have a fully functional FPGA design of an LSTM network with SPSA on-chip training. Henceforth, we will need to benchmark its performance by comparing it to [5, 21], and prove its usefulness by using it in one or two usecases from 3, which are interesting uses of Deep Learning techniques such as these.
  - T4.1 Benchmark the Performance/Computing Time perform a side-by-side comparison of our model with the software counterpart of [5, 21] and with the only hardware implementation of LSTM of [4].
  - T4.2 Reproduce some of the state of the art applications of LSTM in order to be able to show real-life, user-oriented, applications, I understand it is a good idea to try to reproduce some of the state of the art usages of LSTM, outlined in Section ??. It is too early to commit myself to which of these applications I will use, and how many, because that will depend on how well the work on tasks T2 and T3 is flowing. Nevertheless, I intend to at least imperment one or two of them.
- T5 Write the Dissertation Final Report while all the stages of work are developing, this task will obviously be active, by documenting all the design decisions, milestones acheived and logging all the results that come from the implementation. A couple of weeks before the deadline, this will be the only active task, since producing a clear and well-written document requires concentration and undivided attention.

### 4.1.2 Tentative Task Scheduling

Here is the Gantt Chart that depicts the temporal extension of the tasks described in Section 4.1.1.

Work Plan



26 Work Plan

#### 4.2 Software and Hardware resources to be used

#### 4.2.1 Hardware Resources

The main hardware resource that I will use for my solution is the **Xilinx® Zedboard**, whose core is a Zynq-7000 System-on-Chip that features both an FPGA core and an ARM ® Cortex<sup>TM</sup> -A9 processor. It is an FPGA development board that features several I/O interfaces such as Etherner, HDMI output, VGA output, Audio IOs, USB and UART, among other capabilites, making it easy to prototype additional features to the core design (i.e. for building the note transcription system of [24], and to make it operate in real-time, I would need an audio input, which I can easily get from the Audio IO, without investing time in connecting the codec and programming the FPGA to interface it).

#### 4.2.2 Software Resources

In terms of software resources related to the FPGA development, I will need an **HDL simulation tool** to simulate and verify the Verilog code that I will develop, and a **Synthesis tool** to synthesize the Verilog files into RTL code, and to map that description into the resources available in the FPGA. The simulation tool used will be Questasim<sup>TM</sup>, the successor of the well-known Modelsim<sup>TM</sup>, which is an industry-standard tool in the area. For the synthesis tool, I will use Xilinx®Vivado Suite, also an industry-standard tool, and that is the only alternative, since the Zedboard is also produced by Xilinx®. These two tools are proprietary, but there are licenses available at FEUP. Furthermore, I am planning to use FloPoCo <sup>1</sup> to generate some of the modules that perform arithmetic operations, to hasten the developing effort/time, and to ensure that they are the most efficient possible, and don't pose a performance bottleneck to my design; this software is freeware, and a standard in Academia.

For the Python Classes, I will make use of the 3.3 version of the Python distribution, the high-performance numerical computation library Numpy<sup>2</sup>, which is a standard in scientific computing, and the Matplotlib<sup>3</sup> plotting library. All of this software is also free to use, and open-source. I may also need to use the PyBrain<sup>4</sup> framework, since it has an LSTM implementation.

To write the final Disseration report, I will use LATEX(this report itself was also written in LATEX), and mostly use Tikz, Inkscape and Matplotlib to produce figures and drawings. Once again, all of this is free to use.

<sup>1</sup> http://flopoco.gforge.inria.fr/

<sup>&</sup>lt;sup>2</sup>http://www.numpy.org/

<sup>&</sup>lt;sup>3</sup>http://matplotlib.org/

<sup>&</sup>lt;sup>4</sup>http://pybrain.org/

# Chapter 5

# Early Work

28 Early Work

## Appendix A

# **Loren Ipsum**

Depois das conclusões e antes das referências bibliográficas, apresenta-se neste anexo numerado o texto usado para preencher a dissertação.

- A.1 O que é o Loren Ipsum?
- A.2 De onde Vem o Loren?

30 Loren Ipsum

### References

- [1] Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. *Neural Networks*, pages 5–6, 2005.
- [2] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. LSTM: A search space odyssey. *CoRR*, abs/1503.04069, 2015.
- [3] S. Hochreiter and J. Schmidhuber. Long short-term memory. *Neural Computation*, 9(8):1735 80, 1997.
- [4] Andre Xian Ming Chang, Berin Martini, and Eugenio Culurciello. Recurrent neural networks hardware implementation on FPGA. *CoRR*, abs/1511.05552, 2015.
- [5] R. Tavear, J. Dedic, D. Bokal, and A. Zemva. Transforming the 1stm training algorithm for efficient fpga -based adaptive control of nonlinear dynamic systems. *Informacije MIDEM*, 43(2):131 8, 2013.
- [6] Christopher M. Bishop. *Pattern Recognition and Machine Learning (Information Science and Statistics)*. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
- [7] Yoshua Bengio. *Artificial Neural Networks and Their Application to Sequence Recognition*. PhD thesis, McGill University, Montreal, Que., Canada, Canada, 1991. UMI Order No. GAXNN-72116 (Canadian dissertation).
- [8] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. 2001.
- [9] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. *IEEE Transactions on Neural Networks*, 5(2):157 166, 1994. Gradient based learning algorithms;Information latching;Input/output sequences;Learning algorithms;Parametric dynamical system;Recurrent neural network training;Temporal contingencies;.
- [10] F.A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: continual prediction with lstm. *Neural Computation*, 12(10):2451 71, 2000. LSTM;learning algorithms;recurrent neural networks;LSTM networks;continual input streams;forget gate;.
- [11] F.A. Gers and J. Schmidhuber. Recurrent nets that time and count. volume vol.3, pages 189 94, Los Alamitos, CA, USA, 2000. recurrent neural networks;RNN;timing;counting;time intervals;sequential tasks;motor control;rhythm detection;long short-term memory;LSTM;peephole connections;internal cells;multiplicative gates;discrete time steps;stable sequences;highly-nonlinear precisely-timed spike sequences;.

32 REFERENCES

[12] J.C. Spall. Adaptive stochastic approximation by the simultaneous perturbation method. volume vol.4, pages 3872 – 9, Piscataway, NJ, USA, 1998. stochastic approximation;loss functions;stochastic search;Newton-Raphson algorithm;Hessian matrix;iterative method;optimization;root-finding;parameter estimation;adaptive estimation;.

- [13] Y. Maeda and M. Wakamura. Simultaneous perturbation learning rule for recurrent neural networks and its fpga implementation. *IEEE Transactions on Neural Networks*, 16(6):1664 72, 2005. perturbation learning rule; recurrent neural network; FPGA; dynamic information processing; feedforward neural network; recursive learning scheme; correlation learning; analog learning; oscillatory solution; hardware implementation; Hopfield neural network; field-programmable gate array;
- [14] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. *Pattern Analysis and Machine Intelligence, IEEE Transactions on*, 31(5):855–868, May 2009.
- [15] E. Grosicki and H. El Abed. Icdar 2009 handwriting recognition competition. In *Document Analysis and Recognition*, 2009. ICDAR '09. 10th International Conference on, pages 1398–1402, July 2009.
- [16] Thomas M. Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. High-performance ocr for printed english and fraktur using 1stm networks. In *Proceedings of the 2013 12th International Conference on Document Analysis and Recognition*, ICDAR '13, pages 683–687, Washington, DC, USA, 2013. IEEE Computer Society.
- [17] A. Graves, A.-R. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In *Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on*, pages 6645–6649, May 2013.
- [18] Hasim Sak, Andrew W. Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In *INTERSPEECH 2014*, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 338–342, 2014.
- [19] Alex Graves. Generating sequences with recurrent neural networks. *CoRR*, abs/1308.0850, 2013.
- [20] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. volume 4, pages 3104 3112, Montreal, QC, Canada, 2014.
- [21] Sepp Hochreiter, Martin Heusel, and Klaus Obermayer. Fast model-based protein homology detection without alignment. *BIOINFORMATICS*, 23(14):1728–1736, JUL 15 2007.
- [22] Soren Kaae Sonderby, Casper Kaae Sonderby, Henrik Nielsen, and Ole Winther. Convolutional lstm networks for subcellular localization of proteins. volume 9199, pages 68 80, Mexico City, Mexico, 2015.
- [23] B. Lehner, G. Widmer, and S. Bock. A low-latency, real-time-capable singing voice detection method with lstm recurrent neural networks. pages 21 5, Piscataway, NJ, USA, 2015.
- [24] Sebastian Bock and Markus Schedl. Polyphonic piano note transcription with recurrent neural networks. pages 121 124, Kyoto, Japan, 2012.

REFERENCES 33

[25] Andres E. Coca, Debora C. Correa, and Liang Zhao. Computer-aided music composition with 1stm neural network and chaotic inspiration. pages International Neural Network Society (INNS); IEEE Computational Intelligence Society (IEEE–CIS) –, Dallas, TX, United states, 2013.

- [26] D. Eck and J. Schmidhuber. Finding temporal structure in music: Blues improvisation with lstm recurrent networks. volume 2002-January, pages 747 756, Martigny, Switzerland, 2002.
- [27] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. *CoRR*, abs/1411.4555, 2014.
- [28] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. *CoRR*, abs/1412.4729, 2014.
- [29] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. *CoRR*, abs/1411.4389, 2014.
- [30] Yutaka Maeda, Hiroaki Hirano, and Yakichi Kanata. A learning rule of neural networks via simultaneous perturbation and its hardware implementation. *Neural Networks*, pages 251–259, 1995.
- [31] G. Cauwenberghs. An analog vlsi recurrent neural network learning a continuous-time trajectory. *Neural Networks, IEEE Transactions on*, 7(2):346–361, Mar 1996.
- [32] Y. Maeda and Rui J.P. de Figueiredo. Learning rules for neuro-controller via simultaneous perturbation. *Neural Networks, IEEE Transactions on*, 8(5):1119–1130, 1997.