## DNA Sequence Classification [CNN + GRU]
* https://www.kaggle.com/code/zakarii/dna-sequence-classification-cnn-gru/notebook

### About :
In this project, we will explore the world of bioinformatics by using through deep learning. A promoter is a short region of DNA (100–1,000 bp) where transcription of a gene by RNA polymerase begins. It is typically located directly upstream or at the 5′ end of the transcription initiation site. DNA promoter has been proven to be the primary cause of many human diseases, especially diabetes, cancer, or Huntington's disease. Therefore, classifying promoters has become an interesting problem and it has attracted the attention of a lot of researchers in the bioinformatics field. We will try to classify this using Machine Learning and Neural Networks.

#### It includes :
* Importing data from the repository
* Converting text inputs to numerical data
* Building and training classification algorithms
* Comparing and contrasting classification algorithms

### Step 1: Importing the Dataset
The following code cells will import necessary libraries and import the dataset from the repository as a Pandas DataFrame

In [1]:
import pandas as pd
import numpy as np
import os

import joblib

#from keras import utils
#from keras import utils as np_utils
from keras.utils import np_utils
from sklearn.model_selection import train_test_split, GridSearchCV
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv1D, MaxPooling1D, MaxPooling2D, Conv2D, LSTM, GRU, Bidirectional
from keras import regularizers
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.wrappers.scikit_learn import KerasClassifier
import keras

### Step 2: Preprocessing the Dataset
The data is not in a usable form; as a result, we will need to process it before using it to train our algorithms.

In [2]:
DATA_PATH = "../data/bioinformatics/kaggle/DNA-Sequence-Classification-CNN-plus-GRU/"
!ls -al $DATA_PATH

total 9888
drwxrwxr-x 2 dave dave    4096 Sep 29 12:52 .
drwxrwxr-x 4 dave dave    4096 Sep 28 22:14 ..
-rw-rw-r-- 1 dave dave 3896353 Apr 20  2021 NonPromoterSequence.txt
-rw-rw-r-- 1 dave dave 1154447 Sep 28 22:15 NonPromoterSequence.txt.zip
-rw-rw-r-- 1 dave dave     108 Sep 28 22:17 notes.data_bioinformatics_kaggle_DNA-Sequence-Classification-CNN-plus-GRU
-rw-rw-r-- 1 dave dave 3885053 Apr 20  2021 PromoterSequence.txt
-rw-rw-r-- 1 dave dave 1155693 Sep 28 22:12 PromoterSequence.txt.zip
-rw-rw-r-- 1 dave dave     341 Sep 29 12:51 results.zip
-rw-rw-r-- 1 dave dave    4450 Sep 29  2022 test_predictions.csv


In [3]:
df = pd.read_csv(DATA_PATH + 'NonPromoterSequence.txt', sep = '>', )
df.dropna(subset=['Unnamed: 0'], how='all', inplace=True)
df.reset_index(inplace = True)
df.drop(['EP 1 (+) mt:CoI_1; range -400 to -100.', 'index'], axis = 1, inplace=True) #data cleaning after error found
df.rename(columns={'Unnamed: 0': "sequence"}, inplace = True)
df['label'] = 0
display(df)
display(df.shape)

Unnamed: 0,sequence,label
0,TAATTACATTATTTTTTTATTTACGAATTTGTTATTCCGCTTTTAT...,0
1,ATTTTTACAAGAACAAGACATTTAACTTTAACTTTATCTTTAGCTT...,0
2,AGAGATAGGTGGGTCTGTAACACTCGAATCAAAAACAATATTAAGA...,0
3,TATGTATATAGAGATAGGCGTTGCCAATAACTTTTGCGTTTTTTGC...,0
4,AGAAATAATAGCTAGAGCAAAAAACAGCTTAGAACGGCTGATGCTC...,0
...,...,...
11295,TGGTAAAAAATTGTACACCTAACTAGTGCCTTCATGTATACCACCA...,0
11296,AGTGCAACTGGAGCCGTGCCGTGACCCACAGAGATCGCCCACTCGA...,0
11297,GCATGGATTTCATATTATCTTAATCGACTTGCTTTTATAAAATAGG...,0
11298,GTGACCAGGTTTTGCTCTAATGCGAAGTACGGATTGGGTAGAGATA...,0


(11300, 2)

In [4]:
df2 = pd.read_csv(DATA_PATH + 'PromoterSequence.txt', sep = '>', )
df2.dropna(subset=['Unnamed: 0'], how='all', inplace=True)
df2.reset_index(inplace = True)
df2.drop(['EP 1 (+) mt:CoI_1; range -100 to 200.', 'index'], axis = 1, inplace=True)
df2.rename(columns={'Unnamed: 0': "sequence"}, inplace = True)
df2['label'] = 1

display(df2)
display(df2.shape)

Unnamed: 0,sequence,label
0,TTAATTTGTCCTTATTTGATTAAGAAGAATAAATCTTATATATAGA...,1
1,ATAGCTCAAATTGCTTTATTAGTATTAGAATCAGCTGTAGCTATAA...,1
2,AAGCTTCCCTTTAATGTGCTCCTTGTGAATACAGCATTACAATGCC...,1
3,TATGTAGAATCTGTACAAGTATCTGTGTTTGGACAATGGCATGTGT...,1
4,ACATATTACTGCATACAGGTCTCAAATTATAAAATGACACTCGTGG...,1
...,...,...
11295,CGACAAAGTTTGATCCATGTGCATTCTTGGCGCCTTATCGATAGCT...,1
11296,CATATCTACATCTCGCTTGCTCCTTCCCTTTCGCTGCGTGTGTGTG...,1
11297,ATACCGCGGAAGCGCAAAAGTACCAGAATTTCCCTGGTATCGCGCT...,1
11298,ATTATTCCGAATTCTTTTATCAGATTTAAATATGGGAAACACTTTA...,1


(11300, 2)

In [5]:
df = pd.concat([df, df2], axis = 0 )
df.shape

(22600, 2)

In [6]:
for seq in df['sequence']:
    if 'N' in seq:
        display(df.loc[df['sequence'] == seq])

Unnamed: 0,sequence,label
1822,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGAATTC...,0


In [7]:
df.drop([1822], inplace = True)

In [8]:
for seq in df['sequence']:
    if 'N' in seq:
        display(df.loc[df['sequence'] == seq])

In [10]:
sequence = list(df.loc[:, 'sequence'])
encoded_list = []
sequence
#encoded_list

['TAATTACATTATTTTTTTATTTACGAATTTGTTATTCCGCTTTTATAATAAATTATTTTGAAAATAATTGAATCATAAAGATAAATATAAATAGTATTAATTATAATATATATATAATTATAACTTTTTTTTCAATTTTTGGATTATTTTTAATTTCTTTATTTTATTTTATATTTTAAGGCTTTAAGTTAATAAAACTAATAACCTTCAAAGCTATAAATAAAGAAATTTCTTTAAGCCTTAGTAAAACTTACTCCTTCAAAATTGCAGTTTGATATCATTATTGACTATAAGACCTAAT',
 'ATTTTTACAAGAACAAGACATTTAACTTTAACTTTATCTTTAGCTTTACCTTTATGATTATGTTTTATATTATATGGATGAATTAATCATACACAACATATATTTGCTCATTTAGTTCCTCAAGGAACACCCGCTATTCTTATACCTTTTATAGTATGTATTGAAACTATTAGAAATATTATTCGACCTGGAACATTAGCTGTTCGATTAACTGCTAATATAATTGCTGGACATTTATTATTAACTCTTTTAGGAAATACAGGATCTTCTATATCTTATATATTAATAACATTTTTATTAA',
 'AGAGATAGGTGGGTCTGTAACACTCGAATCAAAAACAATATTAAGATAAAAATAGCGCGCACGGCAAGTGTTGCATGGAAGAAGATGAGATCAATTTAGATTCTTTGGAGATTGCTCTTTTTAACGCGACTACCATTTCATTGATATTATTTTACAAAAATGTTCCTGGAACATTTTAGACTCCATCGGTGGTGTCTTCTTTCTTTTTTCTTTTAACATTAGCCAATTGATTGGATGTGGAATCAGAACTGAAAACATTTAAACGATATCTACATAAATACTTCCGAGGTTTTTAATGGTA',
 'TATGTATATAGAGATAGGCGTTGCCAATAACTTTTGCGTTTTTTGCTTAAAAATAATATTGTATCGCCGAGGACAAAAAT

In [19]:
def encode_seq(s):
    Encode = {'A':[1,0,0,0],'T':[0,1,0,0],'C':[0,0,1,0],'G':[0,0,0,1]}
    return [Encode[x] for x in s]

for i in sequence:
    x = encode_seq(i)
    encoded_list.append(x)

X = np.array(encoded_list)
X.shape

(22598, 301, 4)

X = X.reshape(X.shape[0],301, 4, 1) X.shape

In [21]:
y = df['label']
y.shape

(22598,)

In [22]:
X.shape

(22598, 301, 4)

### Step 3: Training and Testing Neural Networks
Now that we have preprocessed the data and built our training and testing datasets, we can start to deploy different convultional neural network architectures. It's relatively easy to test multiple models using gridsearch; as a result, we will compare and contrast the perforance using GridSearchCV over many values.



In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [30]:
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)

In [31]:
params = {
    'first_node': [128, 64],
    'second_node': [32, 64],
    'alpha': [0.001, 0.01],
    'first_filter': [9, 16, 32], 
    'dropout': [0.1, 0.2, 0.5]
}
#used for GridSearchCV

In [32]:
gru_model = Sequential()

gru_model.add(Conv1D(filters = 27, kernel_size = (4), activation = 'relu', input_shape = (301, 4)))
gru_model.add(MaxPooling1D(pool_size= (3)))
gru_model.add(Dropout(0.2))

gru_model.add(Conv1D(filters = 14, kernel_size = (2), activation = 'relu', padding = 'same'))
#cnn_model.add(MaxPooling1D(pool_size= (1)))
#cnn_model.add(Dropout(0.2))



gru_model.add(Bidirectional(GRU(128, activation = 'relu')))
gru_model.add(Dropout(0.2))
gru_model.add(Dense(128, activation = 'relu'))
gru_model.add(Dense(64, activation = 'relu'))
gru_model.add(Dense(64, activation = 'relu'))
gru_model.add(Dense(16, activation = 'relu', kernel_regularizer = regularizers.l2(0.01)))
gru_model.add(Dense(2, activation = 'sigmoid'))

gru_model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

early_stop = keras.callbacks.EarlyStopping(monitor = 'val_accuracy', min_delta = 0.0005, patience=8, 
                                           restore_best_weights=True )
history = gru_model.fit(X_train, y_train, batch_size = 128, validation_data=(X_test, y_test), 
                        epochs=115)

2022-09-28 22:37:57.110050: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/115
Epoch 2/115
Epoch 3/115
Epoch 4/115
Epoch 5/115
Epoch 6/115
Epoch 7/115
Epoch 8/115
Epoch 9/115
Epoch 10/115
Epoch 11/115
Epoch 12/115
Epoch 13/115
Epoch 14/115
Epoch 15/115
Epoch 16/115
Epoch 17/115
Epoch 18/115
Epoch 19/115
Epoch 20/115
Epoch 21/115
Epoch 22/115
Epoch 23/115
Epoch 24/115
Epoch 25/115
Epoch 26/115
Epoch 27/115
Epoch 28/115
Epoch 29/115
Epoch 30/115
Epoch 31/115
Epoch 32/115
Epoch 33/115
Epoch 34/115
Epoch 35/115
Epoch 36/115
Epoch 37/115
Epoch 38/115
Epoch 39/115
Epoch 40/115
Epoch 41/115
Epoch 42/115
Epoch 43/115
Epoch 44/115
Epoch 45/115
Epoch 46/115
Epoch 47/115
Epoch 48/115
Epoch 49/115
Epoch 50/115
Epoch 51/115
Epoch 52/115
Epoch 53/115
Epoch 54/115
Epoch 55/115
Epoch 56/115


Epoch 57/115
Epoch 58/115
Epoch 59/115
Epoch 60/115
Epoch 61/115
Epoch 62/115
Epoch 63/115
Epoch 64/115
Epoch 65/115
Epoch 66/115
Epoch 67/115
Epoch 68/115
Epoch 69/115
Epoch 70/115
Epoch 71/115
Epoch 72/115
Epoch 73/115
Epoch 74/115
Epoch 75/115
Epoch 76/115
Epoch 77/115
Epoch 78/115
Epoch 79/115
Epoch 80/115
Epoch 81/115
Epoch 82/115
Epoch 83/115
Epoch 84/115
Epoch 85/115
Epoch 86/115
Epoch 87/115
Epoch 88/115
Epoch 89/115
Epoch 90/115
Epoch 91/115
Epoch 92/115
Epoch 93/115
Epoch 94/115
Epoch 95/115
Epoch 96/115
Epoch 97/115
Epoch 98/115
Epoch 99/115
Epoch 100/115
Epoch 101/115
Epoch 102/115
Epoch 103/115
Epoch 104/115
Epoch 105/115
Epoch 106/115
Epoch 107/115
Epoch 108/115
Epoch 109/115
Epoch 110/115
Epoch 111/115
Epoch 112/115
Epoch 113/115
Epoch 114/115
Epoch 115/115


In [33]:
pred = gru_model.predict

In [37]:
df = pd.read_csv(DATA_PATH + 'test_predictions.csv' ) #loading full test set
df.head()

Unnamed: 0,0.000000000000000000e+00
0,1.0
1,0.0
2,0.0
3,0.0
4,0.0
