# Model evaluation for the Dengue Serotype Classifier on the E protein

The aim of this notebook is to test different classification algorithms for Dengue serotypes from genomic sequence fragments of the **E protein**

## Preprocessing labeled data

A second approach involves processing _fasta_ files for each serotype.

This time, we use the biopython module to import data from files with fasta format. 

In [5]:
from Bio import SeqIO
import pandas as pd

In [6]:
#fasta_files = [['../data/datos_entrenamiento/D' + str(i) + '_protE_aligned.fasta', str(i)] for i in range(1,5)]
#fasta_files = [['../data/datos_prueba_ProtE_desdegenomas/dengue' + str(i) + '_E_aln.fasta', str(i)] for i in range(1,5)]
sequences_files = [['../data/Trimmed_data/Trimmed_out[937:2421]/trimmed_protein_E_den' + str(i), str(i)] for i in range(1,5)]

In [7]:
sequences_files

[['../data/Trimmed_data/Trimmed_out[937:2421]/trimmed_protein_E_den1', '1'],
 ['../data/Trimmed_data/Trimmed_out[937:2421]/trimmed_protein_E_den2', '2'],
 ['../data/Trimmed_data/Trimmed_out[937:2421]/trimmed_protein_E_den3', '3'],
 ['../data/Trimmed_data/Trimmed_out[937:2421]/trimmed_protein_E_den4', '4']]

In [8]:
def load_sequence_files(files= []):
    '''
    Load sequences from files containing genomic sequences
    
    -----
    param:
    files list containing [path, label]
    
    -------
    returns:
    list containing [sequence, label]
    '''
    res = []
    for f in files:
        
        for s in open(f[0]).readlines():
            res.append([str(s), f[1]])
    
    return res

Let's load the data into a pandas.DataFrame

In [9]:
data = load_sequence_files(sequences_files)

df = pd.DataFrame(data, columns=['sequence', 'label'])

In [10]:
df.shape

(2632, 2)

Do the sequences have the **same lenght**?

In [13]:
df['lenght'] = [len(x) for x in df['sequence']]

In [14]:
df['lenght'].describe()

count    2632.0
mean     1485.0
std         0.0
min      1485.0
25%      1485.0
50%      1485.0
75%      1485.0
max      1485.0
Name: lenght, dtype: float64

## Feature extraction

In [15]:
import sys

sys.path.append('../utils/')

import util

In [16]:
df['sequence'].shape

(2632,)

In [17]:
result = {}

#Replace every character not included in [actg-] with '-'
data_sequences = df['sequence'].replace(to_replace=r'(?![actg\-]).', value='-', regex=True)

for idx, data in data_sequences.iteritems():
    splitted_string = util.insert_separator(data).split(',')
        
    result[idx] = splitted_string

In [18]:
df_sequences = pd.DataFrame.from_dict(result, orient='index')

In [19]:
assert(len(result) == df_sequences.shape[0]) #4805

In [20]:
df_sequences.shape

(2632, 1485)

In [21]:
from sklearn.preprocessing import OneHotEncoder

In [22]:
enc = OneHotEncoder() #(['-', 'a', 'c', 't', 'g'])

df_sequences_dummies = enc.fit_transform(df_sequences)

In [23]:
df_sequences_dummies.shape

(2632, 4941)

We can take a look at the feature names.

In [24]:
enc.get_feature_names()

array(['x0_-', 'x0_t', 'x1_-', ..., 'x1483_g', 'x1483_t', 'x1484_\n'],
      dtype=object)

## Feature selection

Let's remove features with zero variance using scikit learn

In [25]:
from sklearn.feature_selection import VarianceThreshold

In [26]:
sel = VarianceThreshold()

In [27]:
sequences_reduced = sel.fit_transform(df_sequences_dummies)

In [28]:
sequences_reduced.shape

(2632, 4792)

In [29]:
sel.get_support().shape

(4941,)

By removing features with zero variability, we demonstrate that there is no significant dimensionality reduction. Only 4 of the 7439 present zero variability.

## Model training

Now that we have a feature matrix, we can train a model.  

But first, split the data set into test and train.

In [30]:
y = df['label']

In [31]:
y.shape

(2632,)

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
X_train, X_test, y_train, y_test = train_test_split(sequences_reduced, y, test_size=0.4, random_state=0)

In [34]:
X_train.shape, y_train.shape

((1579, 4792), (1579,))

In [35]:
X_test.shape, y_test.shape

((1053, 4792), (1053,))

We try different algorithms for multiclass classification, including

* BernoulliNB
* Decision Trees
* SVM
* Deep learning

### Naive Bayes

In [36]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.naive_bayes import BernoulliNB

In [37]:
clf = BernoulliNB()

In [38]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [39]:
y_pred = clf.predict(X_test)

In [40]:
accuracy_score(y_test, y_pred)

0.9715099715099715

In [41]:
confusion_matrix(y_test, y_pred)

array([[308,  17,   0,   0],
       [  0, 302,   0,   0],
       [  0,   0, 234,   0],
       [  0,  13,   0, 179]])

### Decision trees

In [42]:
from sklearn.tree import DecisionTreeClassifier

In [43]:
clf = DecisionTreeClassifier()

In [44]:
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [45]:
y_pred = clf.predict(X_test)

In [46]:
accuracy_score(y_test, y_pred)

0.9715099715099715

In [47]:
confusion_matrix(y_test, y_pred)

array([[308,  17,   0,   0],
       [  0, 302,   0,   0],
       [  0,   0, 234,   0],
       [  0,  13,   0, 179]])

### SVM

In [48]:
from sklearn.svm import LinearSVC

In [49]:
clf = LinearSVC(multi_class="crammer_singer")

In [50]:
clf.fit(X_train, y_train)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='crammer_singer', penalty='l2', random_state=None,
          tol=0.0001, verbose=0)

In [51]:
y_pred = clf.predict(X_test)

In [52]:
accuracy_score(y_test, y_pred)

0.9715099715099715

In [53]:
confusion_matrix(y_test, y_pred)

array([[308,  17,   0,   0],
       [  0, 302,   0,   0],
       [  0,   0, 234,   0],
       [  0,  13,   0, 179]])

### Neural network

In [54]:
from sklearn.neural_network import MLPClassifier

In [55]:
clf = MLPClassifier()

In [56]:
clf.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

In [57]:
y_pred = clf.predict(X_test)

In [58]:
accuracy_score(y_test, y_pred)

0.9715099715099715

In [59]:
confusion_matrix(y_test, y_pred)

array([[308,  17,   0,   0],
       [  0, 302,   0,   0],
       [  0,   0, 234,   0],
       [  0,  13,   0, 179]])

## Conclusion

The classification task for dengue serotypes, using models exclusively trained with sequences of **E Protein**  reach an accuracy score of 0.97 for every algorithm tested on this notebook. 

## Question