# Covid-19 Genome Analysis Repository

## GOAL: Download and analyze the coronovirus genome.
https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/

#### Inspiration:
https://blog.floydhub.com/exploring-dna-with-deep-learning/

## Connect to NCBI Database and get a list of genome ids based on a particular search

In [50]:
from Bio import Entrez
from Bio import SeqIO
import os
import numpy as np
import re
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

In [51]:
######################################
# Retrieve NCBI Data Online
######################################
Entrez.email = "daniel.delvin.diaz+ncbi@gmail.com"  # Always tell NCBI who you are
search_term = "SARS-CoV2[orgn] AND complete genome[title]"
handle = Entrez.esearch(db="nucleotide", term=search_term)
search_results = Entrez.read(handle)
genome_ids = search_results['IdList']

## Print out one of the results so we can inspect it.

In [52]:
######################################
# Check genome data
######################################

print(f"Found {len(genome_ids)} genomes.")
# Ref Genome: NC_045512
for g in genome_ids:
    handle = Entrez.efetch(db="nucleotide", id=g, rettype="gb", retmode="text")
    text = handle.read()
    # print(text)
    break
    # Note page limit of 20, so there will only be 20 results here, which should be fine

Found 20 genomes.


## Download all the genomes from our search and store them as .gb files.

In [53]:
######################################
# Download Genomes
######################################

for genome_id in genome_ids:
    record = Entrez.efetch(db="nucleotide", id=genome_id, rettype="gb", retmode="text")
    filename = f'{os.path.abspath(".")}/generated/genBankRecord_{genome_id}.gb'
    print('Writing:{}'.format(filename))
    # Only download first genome for now
    with open(filename, 'w') as f:
        f.write(record.read())
    # TODO: Undo to download more genomes
    break

Writing:/Users/ddiaz/src/corona/generated/genBankRecord_1829138121.gb


## Download ref genome for sars-covid-2

In [54]:
######################################
# Download Ref Genome
######################################
Entrez.email = "daniel.delvin.diaz+ncbi@gmail.com"  # Always tell NCBI who you are
search_term = "NC_045512[locus] AND complete genome[title]"
handle = Entrez.esearch(db="nucleotide", term=search_term)
search_results = Entrez.read(handle)
ref_genome_id = search_results['IdList'][0]
record = Entrez.efetch(db="nucleotide", id=ref_genome_id, rettype="gb", retmode="text")
# print(record.read())
filename = f'{os.path.abspath(".")}/generated/genBankRecord_ref.gb'
with open(filename, 'w') as f:
    content = record.read()
    # print(content)
    f.write(content)
print('File Written:{}'.format(filename))
# Note I have noticed a weird behavior with pycharm + jupyter notebook where you wont see the
# file locally unless you click out of pycharm then back in.

File Written:/Users/ddiaz/src/corona/generated/genBankRecord_ref.gb


## Set up functions to transform the genome into its one-hot encoded form.

In [55]:
######################################
# Setup One Hot Encoding Function
######################################

# One hot encode a DNA sequence string
# non 'acgt' bases (n) are 0000
# returns a L x 4 numpy array

label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','c','g','t','z']))

def string_to_array(my_string):
    my_string = my_string.lower()
    my_string = re.sub('[^acgt]', 'z', my_string)
    my_array = np.array(list(my_string))
    return my_array

def one_hot_encoder(my_array):
    integer_encoded = label_encoder.transform(my_array)
    onehot_encoder = OneHotEncoder(sparse=False, dtype=int)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
    onehot_encoded = np.delete(onehot_encoded, -1, 1)
    return onehot_encoded

test_sequence = 'AACGCGGTTNN'
test_sequence_hot = one_hot_encoder(string_to_array(test_sequence))
expected_sequence_hot =   [[1, 0, 0, 0],
                           [1, 0, 0, 0],
                           [0, 1, 0, 0],
                           [0, 0, 1, 0],
                           [0, 1, 0, 0],
                           [0, 0, 1, 0],
                           [0, 0, 1, 0],
                           [0, 0, 0, 1],
                           [0, 0, 0, 1],
                           [0, 0, 0, 0],
                           [0, 0, 0, 0]]

# Lets check this function is working as expected
assert np.array_equal(test_sequence_hot, expected_sequence_hot)

## Load Ref Genome

In [56]:
ref_genome_path = f"{os.path.abspath('.')}/generated/genBankRecord_ref.gb"
ref_genome_seq = None
for seq_record in SeqIO.parse("./generated/genBankRecord_ref.gb", "genbank"):
    #print(seq_record.id)
    #print(repr(seq_record.seq))
    print(f"SARS-CoV-2 Genome: {seq_record.seq}")
    ref_genome_seq = seq_record.seq

SARS-CoV-2 Genome: ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGC

In [57]:
r = ref_genome_seq[0:20]
seq_hot = one_hot_encoder(string_to_array(str(ref_genome_seq)))
print(seq_hot)

[[1 0 0]
 [0 0 0]
 [0 0 0]
 ...
 [1 0 0]
 [1 0 0]
 [1 0 0]]


Next step: load multiple genomes and plot them against each other....

Links:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc132
https://www.kaggle.com/thomasnelson/working-with-dna-sequence-data-for-ml
