Question 1: Can you think of a few applications for a sequence-to-sequence RNN? What about a sequence-to-vector RNN, and a vector-to-sequence RNN?

Few Applications of sequence to sequence RNN is speech recognition, machine translation, image captioning and question answering.

For a sequence-to-vector RNN: classifying music samples by music genre, analyzing the sentiment of a book review, predicting what word an aphasic patient is thinking of based on readings from brain implants, predicting the probability that a user will want to watch a movie based on her watch history (this is one of many possible implementations of collaborative filtering).

For a vector-to-sequence RNN: image captioning, creating a music playlist based on an embedding of the current artist, generating a melody based on a set of parameters, locating pedestrians in a picture (e.g., a video frame from a self-driving car’s camera).

Question 2: How many dimensions must the inputs of an RNN layer have? What does each dimension represent? What about its outputs?

An important thing to note is that the RNN input needs to have 3 dimensions. Typically it would be batch size, the number of steps and number of features.

Question 3: If you want to build a deep sequence-to-sequence RNN, which RNN layers should have return_sequences=True? What about a sequence-to-vector RNN?

To handle variable length input sequences, the simplest option is to set the sequence_length parameter when calling the static_rnn() or dynamic_rnn() functions. Another option is to pad the smaller inputs (e.g., with zeros) to make them the same size as the largest input (this may be faster than the first option if the input sequences all have very similar lengths). To handle variable-length output sequences, if you know in advance the length of each output sequence, you can use the sequence_length parameter (for example, consider a sequence-to-sequence RNN that labels every frame in a video with a violence score: the output sequence will be exactly the same length as the input sequence). If you don’t know in advance the length of the output sequence, you can use the padding trick: always output the same size sequence, but ignore any outputs that come after the end-of-sequence token (by ignoring them when computing the cost function).

Question 4: Suppose you have a daily univariate time series, and you want to forecast the next seven days. Which RNN architecture should you use?

We can use LSTM for this purpose.

Question 5: What are the main difficulties when training RNNs? How can you handle them?

The difficulty of training artificial recurrent neural networks has to do with their complexity.

One of the simplest ways to explain why recurrent neural networks are hard to train is that they are not feedforward neural networks.

By contrast, recurrent neural networks and other different types of neural networks have more complex signal movements. Classed as “feedback” networks, recurrent neural networks can have signals traveling both forward and back, and may contain various “loops” in the network where numbers or values are fed back into the network. Experts associate this with the aspect of recurrent neural networks that's associated with their memory.

The problem, continues, gets worse with long sequences and more numerous time steps, in which the signals grow or decay. Weight initialization may help, but those challenges are built into the recurrent neural network model. There's always going to be that issue attached to their particular design and build. Essentially, some of the more complex types of neural networks really defy our ability to easily manage them. We can create a practically infinite amount of complexity, but we often see predictability and scalability challenges grow.

Question 6: Can you sketch the LSTM cell’s architecture?

LSTM consists of 3 gates namely Forget Gate, Input Gate and Output Gate.

Forget Gate:

A forget gate is responsible for removing information from the cell state. The information that is no longer required for the LSTM to understand things or the information that is of less importance is removed via multiplication of a filter. This is required for optimizing the performance of the LSTM network

Input Gate:

The input gate is responsible for the addition of information to the cell state. This addition of information is basically three-step process as seen from the diagram above.

Regulating what values need to be added to the cell state by involving a sigmoid function. This is basically very similar to the forget gate and acts as a filter for all the information from h_t-1 and x_t.
Creating a vector containing all possible values that can be added (as perceived from h_t-1 and x_t) to the cell state. This is done using the tanh function, which outputs values from -1 to +1.  
Multiplying the value of the regulatory filter (the sigmoid gate) to the created vector (the tanh function) and then adding this useful information to the cell state via addition operation.

Output Gate:

The functioning of an output gate can again be broken down to three steps:

Creating a vector after applying tanh function to the cell state, thereby scaling the values to the range -1 to +1.
Making a filter using the values of h_t-1 and x_t, such that it can regulate the values that need to be output from the vector created above. This filter again employs a sigmoid function.
Multiplying the value of this regulatory filter to the vector created in step 1, and sending it out as a output and also to the hidden state of the next cell.

<img src = "image10.png">

Question 7: Why would you want to use 1D convolutional layers in an RNN?

1D convolutional neural nets can be used for extracting local 1D patches (subsequences) from sequences and can identify local patterns within the window of convolution. And because the same transformation is applied on every patch identified by the window, a pattern learnt at one position can also be recognized at a different position, making 1D conv nets translation invariant.

Another interesting use case is to combine 1D conv nets with RNNs. Suppose you have a long sequence to process so long that it cannot be realistically processed by RNNs. In such cases, 1D conv nets can be used as a pre-processing step to make the sequence smaller through downsampling by extracting higher level features, which can, then be passed on to the RNN as input.

Question 8: Which neural network architecture could you use to classify videos?

To classify videos based on the visual content, one possible architecture could be to take (say) one frame per second, then run each frame through a convolutional neural network, feed the output of the CNN to a sequence-to-vector RNN, and finally run its output through a softmax layer, giving you all the class probabilities. For training you would just use cross entropy as the cost function. If you wanted to use the audio for classification as well, you could convert every second of audio to a spectrograph, feed this spectrograph to a CNN, and feed the output of this CNN to the RNN (along with the corresponding output of the other CNN).