## DNA Sequence Classification [CNN + GRU] - dhole vs wolf
* https://www.kaggle.com/code/zakarii/dna-sequence-classification-cnn-gru/notebook

### About :
In this project, we will explore the world of bioinformatics by using through deep learning. A promoter is a short region of DNA (100–1,000 bp) where transcription of a gene by RNA polymerase begins. It is typically located directly upstream or at the 5′ end of the transcription initiation site. DNA promoter has been proven to be the primary cause of many human diseases, especially diabetes, cancer, or Huntington's disease. Therefore, classifying promoters has become an interesting problem and it has attracted the attention of a lot of researchers in the bioinformatics field. We will try to classify this using Machine Learning and Neural Networks.

#### It includes :
* Importing data from the repository
* Converting text inputs to numerical data
* Building and training classification algorithms
* Comparing and contrasting classification algorithms

### Step 1: Importing the Dataset
The following code cells will import necessary libraries and import the dataset from the repository as a Pandas DataFrame

In [1]:
import pandas as pd
import numpy as np
import os

import joblib

from bioinformatics import FASTADataset as fads
from bioinformatics import KmerVectors as kvec


#from keras import utils
#from keras import utils as np_utils
from keras.utils import np_utils
from sklearn.model_selection import train_test_split, GridSearchCV
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv1D, MaxPooling1D, MaxPooling2D, Conv2D, LSTM, GRU, Bidirectional
from keras import regularizers
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.wrappers.scikit_learn import KerasClassifier
import keras

### Step 2: Preprocessing the Dataset
The data is not in a usable form; as a result, we will need to process it before using it to train our algorithms.

In [2]:
#DATA_PATH = "data/bioinformatics/kaggle/DNA-Sequence-Classification-CNN-plus-GRU/"
#!ls -al $DATA_PATH
NGDC_PATH = "../data/bioinformatics/ngdc/"
IDOG_PATH = NGDC_PATH + "idog/"
#!ls -al $IDOG_PATH

dhole_cds_dataset_file = IDOG_PATH + "dhole.cds.fa"
wolfe_cds_dataset_file = IDOG_PATH + "wolf.cds.fa"
#!ls $dhole_cds_dataset_file
#!ls $wolfe_cds_dataset_file

In [3]:
dhole_cds_fads = fads.FASTADataset('dhole', dhole_cds_dataset_file)
wolfe_cds_fads = fads.FASTADataset('wolfe', wolfe_cds_dataset_file)

In [4]:
# not really using kmers here, just the object for convenience
kv_fasta = kvec.KmerVectors(['A','G','C','T'], 6, fastadatasets=[dhole_cds_fads,wolfe_cds_fads], verbose=True)
print(f'dictionary size: [{len(kv_fasta.dict)}]')
print(kv_fasta.labels)

KmerVectors Object -
alphabet [['A', 'G', 'C', 'T']]
dict: [['AAAAAA', 'AAAAAG', 'AAAAAC', 'AAAAAT']]...[['TTTTTA', 'TTTTTG', 'TTTTTC', 'TTTTTT']]
Labels: [{'dhole': 1, 'wolfe': 2}]
[dhole]
[../data/bioinformatics/ngdc/idog/dhole.cds.fa]
[wolfe]
[../data/bioinformatics/ngdc/idog/wolf.cds.fa]
dictionary size: [4096]
{'dhole': 1, 'wolfe': 2}


In [5]:
df_fasta = kv_fasta.seqFromFASTA(base_count_max=4, length_min=1000, dataset_limit=10000, verbose=True)
df=pd.DataFrame(data=df_fasta)
df
#display(df)
#display(df.shape)

seqFromFASTA
fasta dataset: [dhole]
1000200030004000500060007000800090001000011000120001300014000150001600017000capped at [10000]
-
Total:                [17107]
Using :               [10001]
skip_count_minlength: [69]
skip_count_alphabet: [7038]
fasta dataset: [wolfe]
10002000300040005000600070008000900010000110001200013000140001500016000170001800019000capped at [10000]
-
Total:                [19212]
Using :               [10001]
skip_count_minlength: [304]
skip_count_alphabet: [8908]


Unnamed: 0,v1,v2
0,dhole,ATGGCTCCGATCACTACCAGCCGGGAAGAATTTGATGAAATCCCCA...
1,dhole,ATGGCGGACGGGCCGCGGGCCATGGCTGCGCAGCCGGGGCGCCGGG...
2,dhole,ATGGTTTTTATCCCTCAGAAGACACGGACCCAGAAAGAAGATGGTT...
3,dhole,ATGCTGTCGAGACCAAAGCCAGGAGAGTCAGAGGTGGATTTGCTGC...
4,dhole,ATGAGCGTCCCGGCCTTCATCGACATCAGCGAGGAAGATCAGGCTG...
...,...,...
19995,wolfe,ATGCTGTGGAAGAGGTCAGGGAGCGGCAGGCTCCACATGCAGTATG...
19996,wolfe,ATGAGCTCCTCTAAGGCTATACCTCCCTTTGAATTTGCTTTCAAAG...
19997,wolfe,ATGGGAAGGTGGTGTGGACAGCTGTCCAGAGGGTCCTCTCAGCCCG...
19998,wolfe,ATGTTGTCTGCCTTTGAAGGTTTTGTGATTCTTAAGGACTTCTCCT...


In [6]:
df.v1.replace(to_replace=dict(dhole=0, wolfe=1), inplace=True)
df

Unnamed: 0,v1,v2
0,0,ATGGCTCCGATCACTACCAGCCGGGAAGAATTTGATGAAATCCCCA...
1,0,ATGGCGGACGGGCCGCGGGCCATGGCTGCGCAGCCGGGGCGCCGGG...
2,0,ATGGTTTTTATCCCTCAGAAGACACGGACCCAGAAAGAAGATGGTT...
3,0,ATGCTGTCGAGACCAAAGCCAGGAGAGTCAGAGGTGGATTTGCTGC...
4,0,ATGAGCGTCCCGGCCTTCATCGACATCAGCGAGGAAGATCAGGCTG...
...,...,...
19995,1,ATGCTGTGGAAGAGGTCAGGGAGCGGCAGGCTCCACATGCAGTATG...
19996,1,ATGAGCTCCTCTAAGGCTATACCTCCCTTTGAATTTGCTTTCAAAG...
19997,1,ATGGGAAGGTGGTGTGGACAGCTGTCCAGAGGGTCCTCTCAGCCCG...
19998,1,ATGTTGTCTGCCTTTGAAGGTTTTGTGATTCTTAAGGACTTCTCCT...


In [7]:
#df = pd.read_csv(DATA_PATH + 'NonPromoterSequence.txt', sep = '>', )
#df.dropna(subset=['Unnamed: 0'], how='all', inplace=True)
#df.reset_index(inplace = True)
#df.drop(['EP 1 (+) mt:CoI_1; range -400 to -100.', 'index'], axis = 1, inplace=True) #data cleaning after error found
#df.rename(columns={'Unnamed: 0': "sequence"}, inplace = True)
#df['label'] = 0
#display(df)
#display(df.shape)

In [8]:
#df2 = pd.read_csv(DATA_PATH + 'PromoterSequence.txt', sep = '>', )
#df2.dropna(subset=['Unnamed: 0'], how='all', inplace=True)
#df2.reset_index(inplace = True)
#df2.drop(['EP 1 (+) mt:CoI_1; range -100 to 200.', 'index'], axis = 1, inplace=True)
#df2.rename(columns={'Unnamed: 0': "sequence"}, inplace = True)
#df2['label'] = 1

#display(df2)
#display(df2.shape)

In [9]:
#df = pd.concat([df, df2], axis = 0 )
df.shape

(20000, 2)

In [10]:
for seq in df['v2']:
    if 'N' in seq:
        display(df.loc[df['v2'] == seq])

Unnamed: 0,v1,v2
12410,1,ATGGGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...


In [11]:
df.drop([12410], inplace = True)

In [12]:
#for seq in df['sequence']:
#    if 'N' in seq:
#        display(df.loc[df['sequence'] == seq])

In [13]:
sequence = list(df.loc[:, 'v2'])
encoded_list = []
sequence

['ATGGCTCCGATCACTACCAGCCGGGAAGAATTTGATGAAATCCCCACAGTGGTGGGGATCTTCAGTGCATTTGGCCTGGTCTTCACAGTCTCTCTATTTGCATGGATCTGCTGTCAGAGAAAATCATCCAAGTCTAACAAGACTCCTCCATATAAGTTTGTGCATGTACTAAAAGGAGTTGATATTTATCCTGAAAACCTAAGTAGCAAGAAGAAGTTCGGAGCAGATGACAAAAATGAAGTAAAGAATAAACCAGCTGTGCCAAAGAATTCATTACATCTTGACCTTGAGAAGAGAGATCTCAATGGCAATTTTCCCAAAACAAACCTCAAAGCTAGTACCCCTTCTGATCTGGAGAATGTGACCCCAAAGCTCTTTTCAGAAGGGGAGAAAGAGGCTGTTTCCCCTGATAGCTTAAAGTCCAGCACTTCCCTTTCTTCAGAAGAGAAGCAGGAGAAGCTGGGAACCCTCTTCTTCTCCTTAGAGTACAACTTTGAGAAGAAAGCGTTTGTGGTAAATATCAAGGAAGCCCGTGGCTTGCCTGCCATGGATGAGCAGTCAATGACCTCTGACCCATACATCAAAATGACGATCCTCCCAGAGAAGAAGCATAAAGTGAAAACCAGAGTTCTGAGAAAGACCTTGGACCCGGCTTTTGATGAGACCTTCACATTCTATGGGATCCTCTACACCCAGATCCAAGAGTTGGCCTTGCACTTCACAATCTTGAGTTTTGACAGGTTTTCAAGAGATGATATCATTGGAGAAGTCCTTATCCCTCTTGCAGGAATTGAATTATCTGATGGAAAAATGTTAATGAACAGAGAGATTATCAAAAGAAATGTTAGGAAGTCTTCAGGACGGGGTGAGTTACTGATCTCTCTCTGCTATCAATCCACTACAAATACTCTCACTGTGGTTGTTTTAAAAGCTCGACACCTGCCGAAATCTGATGTGTCTGGACTCTCAGATCCCTACGTGAAAGTGAACCTGTACCA

In [14]:
def encode_seq(s):
    Encode = {'A':[1,0,0,0],'T':[0,1,0,0],'C':[0,0,1,0],'G':[0,0,0,1]}
    return [Encode[x] for x in s]

for i in sequence:
    x = encode_seq(i)
    encoded_list.append(x)

X = np.array(encoded_list)
X.shape

(19999, 1000, 4)

X = X.reshape(X.shape[0],301, 4, 1) X.shape

In [15]:
y = df['v1']
y.shape

(19999,)

In [16]:
X.shape

(19999, 1000, 4)

### Step 3: Training and Testing Neural Networks
Now that we have preprocessed the data and built our training and testing datasets, we can start to deploy different convultional neural network architectures. It's relatively easy to test multiple models using gridsearch; as a result, we will compare and contrast the perforance using GridSearchCV over many values.



In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [18]:
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)

In [19]:
params = {
    'first_node': [128, 64],
    'second_node': [32, 64],
    'alpha': [0.001, 0.01],
    'first_filter': [9, 16, 32], 
    'dropout': [0.1, 0.2, 0.5]
}
#used for GridSearchCV

In [20]:
gru_model = Sequential()

gru_model.add(Conv1D(filters = 27, kernel_size = (4), activation = 'relu', input_shape = (1000, 4)))
gru_model.add(MaxPooling1D(pool_size= (3)))
gru_model.add(Dropout(0.2))

gru_model.add(Conv1D(filters = 14, kernel_size = (2), activation = 'relu', padding = 'same'))
#cnn_model.add(MaxPooling1D(pool_size= (1)))
#cnn_model.add(Dropout(0.2))



gru_model.add(Bidirectional(GRU(128, activation = 'relu')))
gru_model.add(Dropout(0.2))
gru_model.add(Dense(128, activation = 'relu'))
gru_model.add(Dense(64, activation = 'relu'))
gru_model.add(Dense(64, activation = 'relu'))
gru_model.add(Dense(16, activation = 'relu', kernel_regularizer = regularizers.l2(0.01)))
gru_model.add(Dense(2, activation = 'sigmoid'))

gru_model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

early_stop = keras.callbacks.EarlyStopping(monitor = 'val_accuracy', min_delta = 0.005, patience=8, 
                                           restore_best_weights=True )
history = gru_model.fit(X_train, y_train, batch_size = 128, validation_data=(X_test, y_test), 
                        epochs=115)

2022-09-29 16:32:36.685692: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Epoch 1/115
Epoch 2/115
Epoch 3/115
Epoch 4/115
Epoch 5/115
Epoch 6/115
Epoch 7/115
Epoch 8/115
Epoch 9/115
Epoch 10/115
Epoch 11/115
Epoch 12/115

KeyboardInterrupt: 

In [None]:
pred = gru_model.predict

In [None]:
df = pd.read_csv(DATA_PATH + 'test_predictions.csv' ) #loading full test set
df.head()