# Exploratory Data Analysis for the Dengue Serotype Classifier

The aim of this notebook is to identify Dengue serotypes from genomic sequence fragments 

This notebook is organized in the following sections:

* Data preprocessing
* Exploratory Data Analysis (ToDo)

In [17]:
import pandas as pd
import re

## Preprocessing

In order to put clean data into the classifier, we start by getting sequences and descriptions for each sequence

__Load the data__

In [2]:
datafile = open('../data/all_den_proteinE.fasta', 'r')

In [14]:
data = ''

with open('../data/all_den_proteinE.fasta', 'r') as file:
    data = file.read()

In [23]:
print(data[:10000])

>KR919614.1 Dengue virus 1 isolate GZ2014014 envelope gene, partial cds
ATAGGCAACAGAGACTTCGTTGAAGGACTGTCAGGAGCAACATGGGTGGATGTGGTACTGGAGCATGGAA
GCTGTGTCACCACCATGGCAAAGAATAAACCAACATTGGACATTGAACTCTTGAAGACGGAGGTCACGAA
CCCTGCCGTCTTGCGCAAACTGTGCATTGAAGCTAAAATATCAAACACCACCACCGATTCAAGATGTCCA
ACACAAGGAGAAGCTACACTGGTGGAAGAACAAGACGCGAACTTTGTGTGTCGACGAACATTCGTGGACA
GAGGCTGGGGTAATGGTTGTGGACTATTCGGGAAGGGAAGCTTACTAACGTGTGCTAAGTTTAAGTGTGT
GACAAAACTTGAAGGAAAGATAGTTCAATATGAAAACTTAAAATATTCGGTGATAGTCACTGTCCACACT
GGGGACCAGCACCAGGTAGGAAATGAGACTACAGAACATGGAACAATTGCAACCATAACACCTCAAGCTC
CCACGTCGGAAATACAGCTGACTGACTACGGAGCCCTTACATTGGACTGCTCACCTAGAACAGGGCTGGA
CTTTAATGAGATGGTGCTGTTGACAATGAAAGAAAAATCATGGCTTGTCCACAAACAATGGTTTCTAGAC
TTACCATTACCTTGGACCTCGGGGGCTTCAACATCTCAAGAGACTTGGAACAGACAAGATCTGCTGGTCA
CGTTTAAGACAGCTCATGCAAAGAAGCAGGAAGTAGTCGTACTGGGGTCACAAGAAGGAGCAATGCACAC
TGCGTTGACTGGGGCGACAGAAATCCAGACGTCAGGAACGACGACAATCCTCGCAGGACACCTGAAATGT
AGACTGAAAATGGATAAACTGACTTTAAAAGGGGTGTCATATGTGATGTGCACAGGCTCATTTAAGCTAG
AGAAG

As we can see, the data has a sequence description, and the sequence itself.

Now, lets extract both, descriptions and sequences, by using regular expressions.

__Extract descriptions and sequences__

In [156]:
# Extract descriptions
regex_heads = r'\>.+\n'
data_heads = re.findall(regex_heads, data)

data_heads = [d[:-1] for d in data_heads]

In [157]:
# Extract sequences
data_sequences = re.sub(regex_heads, '', data)

data_sequences = re.sub(r'\n\n', '\t', data_sequences)
data_sequences = re.sub(r'\n', '', data_sequences)
data_sequences = re.sub(r'\t', '\n', data_sequences)

regex_sequences = r'[ACTG]+\n'
data_sequences = re.findall(regex_sequences, data_sequences)
data_sequences = [d[:-1] for d in data_sequences]

We must be sure that descriptions and sequences has the same length.

In [160]:
len(data_sequences)

10125

In [161]:
len(data_heads)

10125

Now we are ready to load the data into a Pandas.DataFrame

__Load data into DataFrame__

In [163]:
df = pd.DataFrame({'description': data_heads,
             'sequence': data_sequences})

In [164]:
df.head()

Unnamed: 0,description,sequence
0,>KR919614.1 Dengue virus 1 isolate GZ2014014 e...,ATAGGCAACAGAGACTTCGTTGAAGGACTGTCAGGAGCAACATGGG...
1,>KR919613.1 Dengue virus 1 isolate GZ2014013 e...,ATAGGCAACAGAGACTTCGTGGAAGGACTGTCAGGAGCAACTTGGG...
2,>JN544408.1 Dengue virus 1 isolate SG(EHI)D1/2...,ATGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTTGAAGGACTGT...
3,>JN544406.1 Dengue virus 1 isolate SG(EHI)D1/1...,ATGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTTGAAGGACTGT...
4,>JN544405.1 Dengue virus 1 isolate SG(EHI)D1/2...,ATGCGGTGCGTGGGAATAGGCAACAGAGACTTCGTTGAAGGACTGT...
