#### W4-HW13: Write sets of Python functions that will take a DNA sequence and generate 1) reverse, 2) complement, 3) reverse complement of that input sequence. Provide explanations for what complement and reverse complement sequences mean in molecular biology. Address all [corner cases](http://carpentries-incubator.github.io/python-testing/06-edges/index.html) such as degenerate bases, wrong data type for input, etc. DO NOT use Biopython. Make sure to test your functions for a DNA sequence you found and downloaded from Genbank (please do not use examples shown in the class).


Molecular Biology contains many terms with respect to DNA strands. DNA strands contain the following nucleotides: Adenoise, Guanine, Thymine, and Cytosine (and technically Uracil for mRNA). DNA is forms a double stranded helix with two DNA strands that adhere together through hydrogen bonds. In this helix, Adenosines and Guanines will hydrogen bond with Thymines and Cytosines (respectively) across the helix. Since the hydogen bonding pattern is predictable , you are able to figure out a complementary strand given 1 strand of DNA. This is coresponding strand is called the complementary strand. These start and end of these complementary strands run in anti parallel directions. So, if the strands are put parallel to each other, they will be reverse and complementary to each other. When doing PCR, primers involved have to be reverse complentary to their sequence to work properly. 

### DNA Sequence from Genbank

I imported the following DNA sequence from Genbank. The sequence is a Haemophilus influenzae Rd KW20 chromosome, complete genome. I'll be using the first record from this Fasta file. 
Fasta file was saved to same directory as Jupyter Notebook.
The link to the Genbank is provided below:
 https://www.ncbi.nlm.nih.gov/nuccore/L42023

In [3]:
from Bio import SeqIO

#Importing first record of fasta file using SeqIO.parse and next()
record = next(SeqIO.parse("sequence.fasta", "fasta"))

#Converting fasta sequence to string
Fasta_Sequence = str(record.seq)

print(Fasta_Sequence)

ATGGCAATTAAAATTGGTATCAATGGTTTTGGTCGTATCGGCCGTATCGTATTCCGTGCAGCACAACACCGTGATGACATTGAAGTTGTAGGTATTAACGACTTAATCGACGTTGAATACATGGCTTATATGTTGAAATATGATTCAACTCACGGTCGTTTCGACGGCACTGTTGAAGTGAAAGATGGTAACTTAGTGGTTAATGGTAAAACTATCCGTGTAACTGCAGAACGTGATCCAGCAAACTTAAACTGGGGTGCAATCGGTGTTGATATCGCTGTTGAAGCGACTGGTTTATTCTTAACTGATGAAACTGCTCGTAAACATATCACTGCAGGCGCAAAAAAAGTTGTATTAACTGGCCCATCTAAAGATGCAACCCCTATGTTCGTTCGTGGTGTAAACTTCAACGCATACGCAGGTCAAGATATCGTTTCTAACGCATCTTGTACAACAAACTGTTTAGCTCCTTTAGCACGTGTTGTTCATGAAACTTTCGGTATCAAAGATGGTTTAATGACCACTGTTCACGCAACGACTGCAACTCAAAAAACTGTGGATGGTCCATCAGCTAAAGACTGGCGCGGCGGCCGCGGTGCATCACAAAACATCATTCCATCTTCAACAGGTGCAGCGAAAGCAGTAGGTAAAGTATTACCTGCATTAAACGGTAAATTAACTGGTATGGCTTTCCGTGTTCCAACGCCAAACGTATCTGTTGTTGATTTAACAGTTAATCTTGAAAAACCAGCTTCTTATGATGCAATCAAACAAGCAATCAAAGATGCAGCGGAAGGTAAAACGTTCAATGGCGAATTAAAAGGCGTATTAGGTTACACTGAAGATGCTGTTGTTTCTACTGACTTCAACGGTTGTGCTTTAACTTCTGTATTTGATGCAGACGCTGGTATCGCATTAACTGATTCTTTCGTTAAATTGGTATCTTGGTACGATAACGAAACGGGTTACTCAAACAAAGTATTAGACTTAGTAGCTCATA

### DNA Reverse Function


In [113]:
def DNA_Reverse(sequence):
    '''Takes a DNA sequence and produces its reverse'''
    
    if type(sequence) != str:
        raise Exception("Wrong Data Type Input. DNA sequence must be a string")
    
    sequence = sequence.upper()
    return print(sequence[::-1])

DNA_Reverse(Fasta_Sequence)

AATCGGAAACATCAACATCTATACTCGATGATTCAGATTATGAAACAAACTCATTGGGCAAAGCAATAGCATGGTTCTATGGTTAAATTGCTTTCTTAGTCAATTACGCTATGGTCGCAGACGTAGTTTATGTCTTCAATTTCGTGTTGGCAACTTCAGTCATCTTTGTTGTCGTAGAAGTCACATTGGATTATGCGGAAAATTAAGCGGTAACTTGCAAAATGGAAGGCGACGTAGAAACTAACGAACAAACTAACGTAGTATTCTTCGACCAAAAAGTTCTAATTGACAATTTAGTTGTTGTCTATGCAAACCGCAACCTTGTGCCTTTCGGTATGGTCAATTAAATGGCAAATTACGTCCATTATGAAATGGATGACGAAAGCGACGTGGACAACTTCTACCTTACTACAAAACACTACGTGGCGCCGGCGGCGCGGTCAGAAATCGACTACCTGGTAGGTGTCAAAAAACTCAACGTCAGCAACGCACTTGTCACCAGTAATTTGGTAGAAACTATGGCTTTCAAAGTACTTGTTGTGCACGATTTCCTCGATTTGTCAAACAACATGTTCTACGCAATCTTTGCTATAGAACTGGACGCATACGCAACTTCAAATGTGGTGCTTGCTTGTATCCCCAACGTAGAAATCTACCCGGTCAATTATGTTGAAAAAAACGCGGACGTCACTATACAAATGCTCGTCAAAGTAGTCAATTCTTATTTGGTCAGCGAAGTTGTCGCTATAGTTGTGGCTAACGTGGGGTCAAATTCAAACGACCTAGTGCAAGACGTCAATGTGCCTATCAAAATGGTAATTGGTGATTCAATGGTAGAAAGTGAAGTTGTCACGGCAGCTTTGCTGGCACTCAACTTAGTATAAAGTTGTATATTCGGTACATAAGTTGCAGCTAATTCAGCAATTATGGATGTTGAAGTTACAGTAGTGCCACAACACGACGTGCCTTATGCTATGCCGGCTATGCTGGTTTTGGTAAC

### DNA Complement Function

In [4]:
def DNA_Complement(sequence):
    '''Takes a DNA sequence and produces its complement. Note: Works with Degenerate Bases'''

    if type(sequence) != str:
        raise Exception("Wrong Data Type Input. DNA sequence must be a string")
        
    compl = ''
    sequence = sequence.lower()
    for letter in sequence:
        if letter == 'a':
            compl += 't'
        elif letter == 'c':
            compl += 'g'
        elif letter == 't':
            compl += 'a'
        elif letter == 'g':
            compl += 'c'
        #Adding edge cases for degenerate bases (explanation is below in HW14)
        elif letter == 'r':
            compl += 'y'
        elif letter == 'y':
            compl += 'r'
        elif letter == 'm':
            compl += 'k'
        elif letter == 'k':
            compl += 'm'
        elif letter == 's':
            compl += 'w'
        elif letter == 'w':
            compl += 's'
        elif letter == 'b':
            compl += 'd'
        elif letter == 'd':
            compl += 'b'
        elif letter == 'h':
            compl += 'v'
        elif letter == 'v':
            compl += 'h'
        elif letter == 'n': #Complement for any nucleotide is any nucleotide
            compl += 'n'
        else: #Complement for Unknown base 
            compl += 'U'
    return print(compl.upper())

DNA_Complement(Fasta_Sequence)

TACCGTTAATTTTAACCATAGTTACCAAAACCAGCATAGCCGGCATAGCATAAGGCACGTCGTGTTGTGGCACTACTGTAACTTCAACATCCATAATTGCTGAATTAGCTGCAACTTATGTACCGAATATACAACTTTATACTAAGTTGAGTGCCAGCAAAGCTGCCGTGACAACTTCACTTTCTACCATTGAATCACCAATTACCATTTTGATAGGCACATTGACGTCTTGCACTAGGTCGTTTGAATTTGACCCCACGTTAGCCACAACTATAGCGACAACTTCGCTGACCAAATAAGAATTGACTACTTTGACGAGCATTTGTATAGTGACGTCCGCGTTTTTTTCAACATAATTGACCGGGTAGATTTCTACGTTGGGGATACAAGCAAGCACCACATTTGAAGTTGCGTATGCGTCCAGTTCTATAGCAAAGATTGCGTAGAACATGTTGTTTGACAAATCGAGGAAATCGTGCACAACAAGTACTTTGAAAGCCATAGTTTCTACCAAATTACTGGTGACAAGTGCGTTGCTGACGTTGAGTTTTTTGACACCTACCAGGTAGTCGATTTCTGACCGCGCCGCCGGCGCCACGTAGTGTTTTGTAGTAAGGTAGAAGTTGTCCACGTCGCTTTCGTCATCCATTTCATAATGGACGTAATTTGCCATTTAATTGACCATACCGAAAGGCACAAGGTTGCGGTTTGCATAGACAACAACTAAATTGTCAATTAGAACTTTTTGGTCGAAGAATACTACGTTAGTTTGTTCGTTAGTTTCTACGTCGCCTTCCATTTTGCAAGTTACCGCTTAATTTTCCGCATAATCCAATGTGACTTCTACGACAACAAAGATGACTGAAGTTGCCAACACGAAATTGAAGACATAAACTACGTCTGCGACCATAGCGTAATTGACTAAGAAAGCAATTTAACCATAGAACCATGCTATTGCTTTGCCCAATGAGTTTGTTTCATAATCTGAATCATCGAGTAT

### DNA Reverse Complement

In [5]:
def DNA_Reverse_Complement(sequence):
    '''Takes a DNA sequence and produces its reverse complement. Note: Works with Degenerate Bases)'''

    if type(sequence) != str:
        raise Exception("Wrong Data Type Input. DNA sequence must be a string")
        
    compl = ''
    sequence = sequence.lower()
    for letter in sequence:
        if letter == 'a':
            compl += 't'
        elif letter == 'c':
            compl += 'g'
        elif letter == 't':
            compl += 'a'
        elif letter == 'g':
            compl += 'c'
        #Adding edge cases for degenerate bases
        elif letter == 'r':
            compl += 'y'
        elif letter == 'y':
            compl += 'r'
        elif letter == 'm':
            compl += 'k'
        elif letter == 'k':
            compl += 'm'
        elif letter == 's':
            compl += 'w'
        elif letter == 'w':
            compl += 's'
        elif letter == 'b':
            compl += 'd'
        elif letter == 'd':
            compl += 'b'
        elif letter == 'h':
            compl += 'v'
        elif letter == 'v':
            compl += 'h'
        elif letter == 'n':
            compl += 'n'
        else: #Complement for Unknown base 
            compl += 'U'
    return print(compl[::-1].upper())

DNA_Reverse_Complement(Fasta_Sequence)

TTAGCCTTTGTAGTTGTAGATATGAGCTACTAAGTCTAATACTTTGTTTGAGTAACCCGTTTCGTTATCGTACCAAGATACCAATTTAACGAAAGAATCAGTTAATGCGATACCAGCGTCTGCATCAAATACAGAAGTTAAAGCACAACCGTTGAAGTCAGTAGAAACAACAGCATCTTCAGTGTAACCTAATACGCCTTTTAATTCGCCATTGAACGTTTTACCTTCCGCTGCATCTTTGATTGCTTGTTTGATTGCATCATAAGAAGCTGGTTTTTCAAGATTAACTGTTAAATCAACAACAGATACGTTTGGCGTTGGAACACGGAAAGCCATACCAGTTAATTTACCGTTTAATGCAGGTAATACTTTACCTACTGCTTTCGCTGCACCTGTTGAAGATGGAATGATGTTTTGTGATGCACCGCGGCCGCCGCGCCAGTCTTTAGCTGATGGACCATCCACAGTTTTTTGAGTTGCAGTCGTTGCGTGAACAGTGGTCATTAAACCATCTTTGATACCGAAAGTTTCATGAACAACACGTGCTAAAGGAGCTAAACAGTTTGTTGTACAAGATGCGTTAGAAACGATATCTTGACCTGCGTATGCGTTGAAGTTTACACCACGAACGAACATAGGGGTTGCATCTTTAGATGGGCCAGTTAATACAACTTTTTTTGCGCCTGCAGTGATATGTTTACGAGCAGTTTCATCAGTTAAGAATAAACCAGTCGCTTCAACAGCGATATCAACACCGATTGCACCCCAGTTTAAGTTTGCTGGATCACGTTCTGCAGTTACACGGATAGTTTTACCATTAACCACTAAGTTACCATCTTTCACTTCAACAGTGCCGTCGAAACGACCGTGAGTTGAATCATATTTCAACATATAAGCCATGTATTCAACGTCGATTAAGTCGTTAATACCTACAACTTCAATGTCATCACGGTGTTGTGCTGCACGGAATACGATACGGCCGATACGACCAAAACCATTG

#### W4-HW14: Explain how the built in reverse complement function work in Biopython. How does it handle degenerate bases?


### Reverse_Complement()

In [12]:
from Bio.Seq import Seq
my_dna = Seq("CCCCCGATAGNR")
my_dna.reverse_complement()

Seq('YNCTATCGGGGG')

The reverse_complement() function works only on the Seq datatype. Seq is a special datatype imported from the Biopython module. It is similar to a string, but allows for special biological functions. Seq datatype can also be used for degenerate bases. Degenerate bases are placeholders for specific nucleotides. 

The universal codes for specifying a degenerate bases are: R = A/G, Y = C/T, M = A/C, K = G/T, S = C/G, W = A/T, B = C/G/T, D = A/G/T, H = A/C/T, V = A/C/G, and N = A/C/G/T.

So the pairs for degenerate bases are the following:
* R and Y
* M and K
* S and W
* B and D
* H and V
* N for any nucleotide

The code for HW13 uses this key for the complement and reverse complement functions

The function first assigns complements to each letter in the sequence. If there is a degenerate base, it will use the complement of that degenerate base. In the above example, since R = G or A, its complement is Y (which denotes C or T). Finally, the function reverses the string after these operations and returns the result. 

#### W4-HW15: What does the "format_type" keyword in qblast function do? Explain and show with examples. 

In [None]:
from Bio.Blast import NCBIWWW

result_handle = NCBIWWW.qblast("blastn", "nt", "8332116", format_type="Text")

print(result_handle)


In [None]:
from Bio.Blast import NCBIWWW

result_handle = NCBIWWW.qblast("blastn", "nt", "8332116", format_type="HTML")

for i in result_handle:
    print(each)

The qblast function can return the BLAST results in various formats, which you can choose with the optional format_type keyword: "HTML", "Text", "ASN.1", or "XML". The default is "XML".

Unfortunately, the code takes a long time to run and won't output properly. 

#### W4-HW16: Can you fix the following code, so that it can extract and return two random sequences from the given fasta file?

    # let's get some random sequences from our large fasta file: without using Biopython (you can use other modules)

    import random

    with open('../datasets/ls_orchid.fasta') as f:
        data = f.read().splitlines()
        print(data)
        for i in random.sample(range(0, len(data), 2), 2):
            print(data[i])
            print(data[i+1])

In [159]:
import random
with open('../../../trgn515_2023/datasets/ls_orchid.fasta') as f:
    data = f.read().splitlines()
    
    #Removing empty elements in list
    while("" in data):
        data.remove("")
        
    #Removing all lines that start with a > to exclude id line from list 
    fasta_sequences = [x for x in data if not x.startswith('>')]
    
    #Returns 2 random samples from the list without repetition
    final_samples = random.sample(fasta_sequences, 2)
    
    print(final_samples[0])
    print(final_samples[1])

GCGCCCGAGGCCATCAGGCCAAGGGCACGCCTGCCTGGGCATTGCGAGTCAAATCTCTCCCTTAATGAGG
AGATAGAACCGGCAGAGGTCTTCGTCCTCCATGGAACCGGGGAGGCCCGGCATACCACCATACCCCCAAT
