# Model evaluation for the Dengue Serotype Classifier

The aim of this notebook is to test different classification algorithms for Dengue serotypes from genomic sequence fragments 

## Preprocessing labeled data

A second approach involves processing _fasta_ files for each serotype.

This time, we use the biopython module to import data from files with fasta format. 

In [1]:
from Bio import SeqIO
import pandas as pd

In [2]:
#fasta_files = [['../data/datos_entrenamiento/D' + str(i) + '_protE_aligned.fasta', str(i)] for i in range(1,5)]
fasta_files = [['../data/datos_prueba_ProtE_desdegenomas/dengue' + str(i) + '_E_aln.fasta', str(i)] for i in range(1,5)]

In [3]:
fasta_files

[['../data/datos_prueba_ProtE_desdegenomas/dengue1_E_aln.fasta', '1'],
 ['../data/datos_prueba_ProtE_desdegenomas/dengue2_E_aln.fasta', '2'],
 ['../data/datos_prueba_ProtE_desdegenomas/dengue3_E_aln.fasta', '3'],
 ['../data/datos_prueba_ProtE_desdegenomas/dengue4_E_aln.fasta', '4']]

In [4]:
def load_fasta_files(files= []):
    '''
    Load sequences from fasta files
    
    -----
    param:
    files list containing [path, label]
    
    -------
    returns:
    list containing [label, id, sequence]
    '''
    res = []
    for f in files:
        fasta_file = SeqIO.parse(open(f[0]), 'fasta')
        
        for s in fasta_file:
            res.append([s.id, str(s.seq), f[1]])
    
    return res

Let's load the data into a pandas.DataFrame

In [5]:
data = load_fasta_files(fasta_files)

df = pd.DataFrame(data, columns=['id', 'sequence', 'label'])

In [6]:
df.shape

(4998, 3)

In [7]:
len(df.iloc[0]['sequence'])

1219

## Feature extraction

In [8]:
import sys

sys.path.append('../utils/')

import util

In [9]:
df['sequence'].shape

(4998,)

In [10]:
result = {}

#Replace every character not included in [actg-] with '-'
data_sequences = df['sequence'].replace(to_replace=r'(?![actg\-]).', value='-', regex=True)

for idx, data in data_sequences.iteritems():
    splitted_string = util.insert_separator(data).split(',')
        
    result[idx] = splitted_string

In [11]:
df_sequences = pd.DataFrame.from_dict(result, orient='index')

In [12]:
assert(len(result) == df_sequences.shape[0]) #4805

In [13]:
df_sequences.shape

(4998, 1219)

In [14]:
from sklearn.preprocessing import OneHotEncoder

In [15]:
enc = OneHotEncoder() #(['-', 'a', 'c', 't', 'g'])

df_sequences_dummies = enc.fit_transform(df_sequences)

In [16]:
df_sequences_dummies.shape

(4998, 5131)

We can take a look at the feature names.

In [17]:
enc.get_feature_names()

array(['x0_-', 'x0_a', 'x0_c', ..., 'x1218_c', 'x1218_g', 'x1218_t'],
      dtype=object)

## Feature selection

Let's remove features with zero variance using scikit learn

In [18]:
from sklearn.feature_selection import VarianceThreshold

In [19]:
sel = VarianceThreshold()

In [20]:
sequences_reduced = sel.fit_transform(df_sequences_dummies)

In [21]:
sequences_reduced.shape

(4998, 5131)

In [22]:
sel.get_support().shape

(5131,)

By removing features with zero variability, we demonstrate that there is no significant dimensionality reduction. Only 4 of the 7439 present zero variability.

## Model training

Now that we have a feature matrix, we can train a model.  

But first, split the data set into test and train.

In [23]:
y = df['label']

In [24]:
y.shape

(4998,)

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_train, X_test, y_train, y_test = train_test_split(sequences_reduced, y, test_size=0.4, random_state=0)

In [27]:
X_train.shape, y_train.shape

((2998, 5131), (2998,))

In [28]:
X_test.shape, y_test.shape

((2000, 5131), (2000,))

We try different algorithms for multiclass classification, including

* BernoulliNB
* Decision Trees
* SVM
* Deep learning

### Naive Bayes

In [29]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.naive_bayes import BernoulliNB

In [30]:
clf = BernoulliNB()

In [31]:
clf.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [32]:
y_pred = clf.predict(X_test)

In [33]:
accuracy_score(y_test, y_pred)

1.0

In [34]:
confusion_matrix(y_test, y_pred)

array([[823,   0,   0,   0],
       [  0, 664,   0,   0],
       [  0,   0, 400,   0],
       [  0,   0,   0, 113]])

### Decision trees

In [35]:
from sklearn.tree import DecisionTreeClassifier

In [36]:
clf = DecisionTreeClassifier()

In [37]:
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [38]:
y_pred = clf.predict(X_test)

In [39]:
accuracy_score(y_test, y_pred)

0.999

In [40]:
confusion_matrix(y_test, y_pred)

array([[822,   0,   1,   0],
       [  0, 663,   1,   0],
       [  0,   0, 400,   0],
       [  0,   0,   0, 113]])

### SVM

In [41]:
from sklearn.svm import LinearSVC

In [42]:
clf = LinearSVC(multi_class="crammer_singer")

In [43]:
clf.fit(X_train, y_train)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='crammer_singer', penalty='l2', random_state=None,
          tol=0.0001, verbose=0)

In [44]:
y_pred = clf.predict(X_test)

In [45]:
accuracy_score(y_test, y_pred)

1.0

In [46]:
confusion_matrix(y_test, y_pred)

array([[823,   0,   0,   0],
       [  0, 664,   0,   0],
       [  0,   0, 400,   0],
       [  0,   0,   0, 113]])

### Neural network

In [47]:
from sklearn.neural_network import MLPClassifier

In [48]:
clf = MLPClassifier()

In [49]:
clf.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)

In [50]:
y_pred = clf.predict(X_test)

In [51]:
accuracy_score(y_test, y_pred)

1.0

In [52]:
confusion_matrix(y_test, y_pred)

array([[823,   0,   0,   0],
       [  0, 664,   0,   0],
       [  0,   0, 400,   0],
       [  0,   0,   0, 113]])

## Conclusion

The classification task for dengue serotypes reach an accuracy score of 1.0 for SVM and MLP algorithms. 
Feature engineering involves the creation of dummy variables and zero variability features removal.