###  The most important type of sequential data is the time series data, which is a series of data points listed in time order. This data is key for applications such as speech recognition, sentiment analysis, language translation, and so on.

The field of genomics, which consists of the most natural language ever – a sequence of nucleotides (A, G, C, and T) – is very well suited for RNNs applications, such as for predicting proteins from DNA sequences, predicting the binding domains of proteins, predicting the interaction between enhancers and promoters, predicting structural motifs, predicting base calls from sequencing instruments, optimizing coding sequences for increased protein production, predicting function, and so on. In this chapter, you will learn what RNNs are, how they are different from FNNs and CNNs, and how they are better suited for sequential data. By the end of this chapter, you will understand what RNNs are and why they are important in DL, the different types of RNN architectures and when to use what, and the different RNN applications in genomics.

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781804615447/files/image/B18958_06_001.jpg)

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781804615447/files/image/B18958_06_002.jpg)

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781804615447/files/image/B18958_06_003.jpg)

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781804615447/files/image/B18958_06_004.jpg)

Another good way of illustrating how RNNs work is to explain it with an example: Imagine you have a standard FNN and give it a DNA sequence (ATGCGAG) and it processes one nucleotide at a time but by the time it reaches the last nucleotide (in this example ‘G’) it has forgotten everything about other nucleotides ‘A’, ‘T’, ‘G’, ‘C’, ‘G’, ‘A’ and FNN can't predict what nucleotide would come next. This information is important for sequential data such as DNA sequences because there is a structure to the sequence

##### Understanding RNNs through Transcription Factor Binding Site (TFBS) predictions Transcription factors (TF) play a key role in gene regulation, particularly during transcription, where they bind to the promoter regions and initiate the process of transcription. Transcription Factor Binding Sites (TFBSs) in DNA are short sequences in gene regulatory regions (such as promoters) and typically range in size from 5 bp to 20 bp. Each TF binds to a different TFBS and controls gene regulation in the cell. Thus, identifying the TF binding sites is key for us to understand cellular and molecular processes. Several experimental methods can identify TFBSs, such as ChIP-Seq technologies and databases such as ENCODE, which have made TFBS information available to researchers. However, ChIP-Seq technologies are expensive, slow, and laborious, and cannot find patterns in the identified TFBS. Several computational methods have become the go-to for solving this very important problem of identifying TFBSs. Given a particular sequence, predicting whether it is a TFBS or not is the core task of bioinformatics. In the following toy example, let’s see how we can use RNNs to predict a TFBS from DNA sequences. The problem of TFBS can be thought of as a binary classification problem – that is, whether the TFBS can be found in a DNA sequence or not, which we represent as 1 or 0, respectively. The input to the RNN model is the input DNA sequences and their targets, which have labels of 1 or 0. The goal here is to build a highly accurate classification model using an RNN that can be used to

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781804615447/files/image/B18958_06_015.jpg)

##### As shown in Figure 6.15, samples 1 to 3 consist of positive TFBSs (label=1), where the DNA sequences consist of TFBSs, whereas samples 4 and 5 are negative (label=0):

- The first thing we must do is convert DNA sequence into a one-hot encoding vector. To refresh your memory, a one-hot encoding vector converts each nucleotide of the DNA sequence into a binary vector, labeled 0 or 1.
- After the input is fed into the RNN, it produces another matrix. As we just learned, at each timestamp, the RNN takes an input vector and the previously hidden state vector and produces the new hidden state recursively. In this case, each position in the sequence is a timestamp. At the end of the training process, the RNN produces an output vector at the timestamp of the input sequence (Figure 6.16).
- Then, the output vector from the RNN is fed into a softmax activation function in the last layer of the network, which learns the mapping between the hidden space and the target label (0 or 1). The final output is a probability that indicates whether the DNA sequence is a TFBS or a non-TFBS.
Like other FNNs and RNNs, we calculate the loss (cross-entropy loss) and then the model is trained until the network generates low or no loss. This minimization of the loss function is achieved using the BPTT algorithm. We can use dropout as a regularization method for the model to prevent overfitting:

![](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781804615447/files/image/B18958_06_016.jpg)

The human reference genome was first divided into 200 bp non-overlapping segments
For each of the 690 ChIP-seq experiments, if 100 bp-200 bp segments belonged to a peak, it was classified as positive (label=1); otherwise, it was classified as negative (label=0)
800 bp (400 bp on either side) sequences was addede to both sides of the 200 bp segment to create a 1,000 bp input sequence

- After feeding input to the network through the input layer, the next layer will be a CNN. You might be wondering why we are using a CNN since the goal is to leverage an RNN for a genomics problem. This is because the CNN layer acts as a motif scanner, as we learned in the previous chapter.
- The output from the CNN is fed into the BiLSTM layer. The output from the BiLSTM layer is then flattened and fed into a fully connected layer.
- In the output layer of the network, a sigmoid function is applied. The final output is a 690-dimensional vector, where each element corresponds to the ChIP-seq experiment.

In [1]:
import numpy as np
from sklearn import metrics
import pandas as pd

In [2]:
X_train = np.load('../Chapter06/data/X_train.npy.zip')['X_train']
y_train = np.load('../Chapter06/data/y_train.npy.zip')['y_train']

In [3]:
X_train.shape

(10000, 1000, 4)

In [4]:
y_train.shape

(10000, 690)

In [5]:
X_test = np.load('../Chapter06/data/X_test.npy.zip')['X_test']
y_test = np.load('../Chapter06/data/y_test.npy.zip')['y_test']

In [6]:
X_test.shape

(1000, 1000, 4)

In [7]:
y_test.shape

(1000, 690)

In [8]:
from keras.models import Sequential
from keras.models import Model
from keras.layers import Dense, Dropout, Activation, Flatten, Layer, Input
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers import LSTM
from keras.layers import Bidirectional
from keras.callbacks import ModelCheckpoint, EarlyStopping

2023-06-23 10:48:33.357529: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-23 10:48:39.700361: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-23 10:48:39.700383: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-06-23 10:48:53.654372: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

In [18]:
input_data = Input(shape=(1000,4))

In [19]:
output = Conv1D(320, kernel_size=26, activation='relu')(input_data)
output = MaxPooling1D()(output)
output = Dropout(0.2)(output)

In [20]:
output = Bidirectional(LSTM(320, return_sequences=True))(output)
output = Dropout(0.5)(output)

In [21]:
flat_output = Flatten()(output)

In [22]:
FC_output = Dense(695)(flat_output)
FC_output = Activation('relu')(FC_output)

In [23]:
output = Dense(690)(FC_output)
output = Activation('sigmoid')(output)

In [24]:
model = Model(inputs=input_data, outputs=output)

In [29]:
model.compile(loss='binary_crossentropy', optimizer='adam')

In [30]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1000, 4)]         0         
                                                                 
 conv1d_3 (Conv1D)           (None, 975, 320)          33600     
                                                                 
 max_pooling1d_2 (MaxPooling  (None, 487, 320)         0         
 1D)                                                             
                                                                 
 dropout_2 (Dropout)         (None, 487, 320)          0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 487, 640)         1640960   
 nal)                                                            
                                                                 
 dropout_3 (Dropout)         (None, 487, 640)          0     

In [31]:
checkpoints = ModelCheckpoint(filepath='./model/bilstm_model.hdf5', verbose=1, save_best_only=False)
earlystopper = EarlyStopping(monitor='val_loss', patience=10, verbose=1)

In [32]:
history = model.fit(X_train, y_train, batch_size=100, 
                    epochs=2, shuffle=True, verbose=1, validation_split=0.1, 
                    callbacks=[checkpoints,earlystopper])

Epoch 1/2
Epoch 1: saving model to ./model/bilstm_model.hdf5
Epoch 2/2
Epoch 2: saving model to ./model/bilstm_model.hdf5


##### RNNs are a special type of neural network that is well suited for sequential data such as time series, audio, video, and text. Research showed that RNNs have improved the performance of sequential data types when compared to other architectures such as FNNs and CNNs. The key to an RNN is the sequence memory state, which helps it store information from the previously analyzed state; this is good for sequential signal analysis and predictive analysis. In this chapter, we learned how RNNs are different from FNNs and CNNs. We understood the different types of RNNs and what makes them good for sequential data analysis by looking at a few examples. RNNs, as you may have noticed, are good for mapping a fixed or variable-sized input sequence to a fixed or variable-sized output; we have seen several examples to understand this.

##### We also looked at how RNNs can help with genomics tasks and understood the different architectural types of RNNs. Bidirectional RNN, LSTM, and GRU are variants of RNNs that are capable of long-term associations, thereby retaining the information from an infinite sequence, which is very common in genomics. They address long-term dependencies.

##### You were also introduced to the different RNN types and their applications in various domains, such as image captioning, language translation, and others. Finally, we looked at how RNNs are used to solve some of the key problems in genomics, such as TF binding site detection, miRNA-mRNA sequence modeling, gene expression analysis, histone modifications, base calling, and more. In the next chapter, we will look at another exciting neural network architecture called autoencoders, which has a lot of potential applications in genomics.