## Analysis of Insulin: from DNA to Protein Project

### In this project I study and process the DNA sequence for human insulin, obtained from NCBI:NM_000207.3.  By applying python and a toolkit module I validate the DNA sequence and present an example of a non DNA sequence that can not be validated. Addtionally, to represent a real biological function, the reverese complement DNA and transcription results are determined.  Furhtermore,  statistical paramerters such as nucleotinde counts, GC content/ratio, and ratio of codon use for a given amino acid are calculated. Finally, open reading frames and proteins found within the DNA sequence are elucidated. 

In [1]:
from DNAtoolkit import *

In [2]:
# Insulin DNA NCBI Reference Sequence: NM_000207.3
DNA = 'AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC'

In [3]:
# validate string to be DNA sequence by using validateSeq function:
print(f" 5'{validateSeq(DNA)}3'")

 5'AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC3'


In [4]:
# validate string to be DNA sequence by using validateSeq function:(not DNA sequence example)
not_DNA = 'ATGCRBFE'
print(f" 5'{validateSeq(not_DNA)}3'")


 5'False3'


### Counting Neucleotides

In [39]:
# Determine the total number of nucleotides in the DNA sequence:
print (f' Sequence Length:{len(DNA)}\n')

 Sequence Length:465



### Three ways to count individual nucleotides

In [6]:
# I  simple method for nucleotide count
A = DNA.count ('A')
C = DNA.count ('C')
G = DNA.count ('G')
T = DNA.count ('T')
print (A, C, G, T)
print ('GC ratio of DNA sequence =', (G+C)/(G+C+A+T))

91 156 141 77
GC ratio of DNA sequence = 0.6387096774193548


In [7]:
# II nucleotide count and report with identifying messeage.
DNA_seq  = input('please enter DNA sequence below:\n')
A = DNA.count ('A')
C = DNA.count ('C')
G = DNA.count ('G')
T = DNA.count ('T')
GC_ratio = (G+C)/(G+C+T+A)
print ('the number of nucleotides of your DNA sequence is',len(DNA))
print ('the number of A in your DNA sequence is:', A)
print ('the number of C in your DNA sequence is:', C)
print ('the number of G in your DNA sequence is:', G)
print ('the number of T in your DNA sequence is:', T)
print ('the GC ratio of your DNA sequence is:', GC_ratio)


please enter DNA sequence below:
'AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC'
the number of nucleotides of your DNA sequence is 465
the number of A in your DNA sequence is: 91
the number of C in your DNA sequence is: 156
the number of G in your DNA sequence is: 141
the number of T in your DNA sequence is: 77
the GC ratio of your DNA sequence is: 0.6387096774193548


In [8]:
# III count nucleotides in DNA sequence is by using the 'countnucFrequency' Function from DNAtoolkit
# the function returns a dictionary with the nucleotides and the key
print(countnucFrequency(DNA))

{'A': 91, 'C': 156, 'G': 141, 'T': 77}


### Transcribing DNA to RNA

In [10]:
# converting DNA sequence to complementary DNA sequence and reporting in 5'-3'(reversed) with 'complement_DNA' function
complement_DNA =(rev_complement (DNA))
print(f" 5'{complement_DNA}3'")

 5'GCTGGTTCAAGGGCTTTATTCCATCTCTCTCGGTGCAGGAGGCGGCGGGTGTGGGGCTGCCTGCGGGCTGCGTCTAGTTGCAGTAGTTCTCCAGCTGGTAGAGGGAGCAGATGCTGGTACAGCATTGTTCCACAATGCCACGCTTCTGCAGGGACCCCTCCAGGGCCAAGGGCTGCAGGCTGCCTGCACCAGGGCCCCCGCCCAGCTCCACCTGCCCCACCTGCAGGTCCTCTGCCTCCCGGCGGGTCTTGGGTGTGTAGAAGAAGCCTCGTTCCCCGCACACTAGGTAGAGAGCTTCCACCAGGTGTGAGCCGCACAGGTGTTGGTTCACAAAGGCTGCGGCTGGGTCAGGTCCCCAGAGGGCCAGCAGCGCCAGCAGGGGCAGGAGGCGCATCCACAGGGCCATGGCAGAAGGACAGTGATCTGCTTGATGGCCTCTTCTGATGCAGCCTGTCCTGGAGGGCT3'


In [11]:
# converting complementary DNA sequence to RNA with 'transcription' function
RNA =(transcription(complement_DNA))
print(f"5'{RNA}3'")

5'GCUGGUUCAAGGGCUUUAUUCCAUCUCUCUCGGUGCAGGAGGCGGCGGGUGUGGGGCUGCCUGCGGGCUGCGUCUAGUUGCAGUAGUUCUCCAGCUGGUAGAGGGAGCAGAUGCUGGUACAGCAUUGUUCCACAAUGCCACGCUUCUGCAGGGACCCCUCCAGGGCCAAGGGCUGCAGGCUGCCUGCACCAGGGCCCCCGCCCAGCUCCACCUGCCCCACCUGCAGGUCCUCUGCCUCCCGGCGGGUCUUGGGUGUGUAGAAGAAGCCUCGUUCCCCGCACACUAGGUAGAGAGCUUCCACCAGGUGUGAGCCGCACAGGUGUUGGUUCACAAAGGCUGCGGCUGGGUCAGGUCCCCAGAGGGCCAGCAGCGCCAGCAGGGGCAGGAGGCGCAUCCACAGGGCCAUGGCAGAAGGACAGUGAUCUGCUUGAUGGCCUCUUCUGAUGCAGCCUGUCCUGGAGGGCU3'


### Calculating GC content in complete DNA/RNA sequence & DNA sequence subsections by size.

In [12]:
# calculate GC content in DNA sequence using 'GC_content' function
print(f' GC content in DNA sequence is: {GC_content(DNA)}%')

 GC content in DNA sequence is: 64%


In [13]:
# calculate GC content in RNA sequence using 'GC_content' function
print(f' GC content in RNA sequence is: {GC_content(RNA)}%')

 GC content in RNA sequence is: 64%


In [14]:
# calculate GC content in subsections of sequence using 'GC_content_subsec' function.
# default subsection k=20.
print (f' The GC content in DNA subsections k=75: {GC_content_subsec(DNA,k=75)}%')

 The GC content in DNA subsections k=75: [59, 68, 61, 75, 60, 61]%


### Determine the frequencey ratio of condons for a given AA in a DNA sequence

In [26]:
#calculating the frequency of  codons for a given AA in a DNA sequence by applying the 'codon_usage' function.
# the number of times a codon appears in the sequence/the number of times 
# the AA is coded for. The function produces a dictinary with the codon as the key.
L_codon_freq = codon_usage(DNA,'L')
M_codon_freq = codon_usage(DNA,'M')
print('L & M AA codon frequencies are respectively:')
print (L_codon_freq)
print ()
print (M_codon_freq)

L & M AA codon frequencies are respectively:
{'CTT': 0.14, 'CTG': 0.43, 'TTG': 0.29, 'CTC': 0.14}

{'ATG': 1.0}


### Convert DNA sequence to AA sequence

In [28]:
# translating DNA to AA sequence using the 'translate_seq' function
# the output is a list of AA. 
DNA_to_AA= translate_seq(DNA)
print(DNA_to_AA)

['S', 'P', 'P', 'G', 'Q', 'A', 'A', 'S', 'E', 'E', 'A', 'I', 'K', 'Q', 'I', 'T', 'V', 'L', 'L', 'P', 'W', 'P', 'C', 'G', 'C', 'A', 'S', 'C', 'P', 'C', 'W', 'R', 'C', 'W', 'P', 'S', 'G', 'D', 'L', 'T', 'Q', 'P', 'Q', 'P', 'L', '_', 'T', 'N', 'T', 'C', 'A', 'A', 'H', 'T', 'W', 'W', 'K', 'L', 'S', 'T', '_', 'C', 'A', 'G', 'N', 'E', 'A', 'S', 'S', 'T', 'H', 'P', 'R', 'P', 'A', 'G', 'R', 'Q', 'R', 'T', 'C', 'R', 'W', 'G', 'R', 'W', 'S', 'W', 'A', 'G', 'A', 'L', 'V', 'Q', 'A', 'A', 'C', 'S', 'P', 'W', 'P', 'W', 'R', 'G', 'P', 'C', 'R', 'S', 'V', 'A', 'L', 'W', 'N', 'N', 'A', 'V', 'P', 'A', 'S', 'A', 'P', 'S', 'T', 'S', 'W', 'R', 'T', 'T', 'A', 'T', 'R', 'R', 'S', 'P', 'Q', 'A', 'A', 'P', 'H', 'P', 'P', 'P', 'P', 'A', 'P', 'R', 'E', 'M', 'E', '_', 'S', 'P', '_', 'T', 'S']


In [29]:
# Generate reading frames from sequence using 'gen_reading_frames' function
print ('[9]  Reading frames:')
for frame in gen_reading_frames(DNA):
    print(frame)

[9]  Reading frames:
['S', 'P', 'P', 'G', 'Q', 'A', 'A', 'S', 'E', 'E', 'A', 'I', 'K', 'Q', 'I', 'T', 'V', 'L', 'L', 'P', 'W', 'P', 'C', 'G', 'C', 'A', 'S', 'C', 'P', 'C', 'W', 'R', 'C', 'W', 'P', 'S', 'G', 'D', 'L', 'T', 'Q', 'P', 'Q', 'P', 'L', '_', 'T', 'N', 'T', 'C', 'A', 'A', 'H', 'T', 'W', 'W', 'K', 'L', 'S', 'T', '_', 'C', 'A', 'G', 'N', 'E', 'A', 'S', 'S', 'T', 'H', 'P', 'R', 'P', 'A', 'G', 'R', 'Q', 'R', 'T', 'C', 'R', 'W', 'G', 'R', 'W', 'S', 'W', 'A', 'G', 'A', 'L', 'V', 'Q', 'A', 'A', 'C', 'S', 'P', 'W', 'P', 'W', 'R', 'G', 'P', 'C', 'R', 'S', 'V', 'A', 'L', 'W', 'N', 'N', 'A', 'V', 'P', 'A', 'S', 'A', 'P', 'S', 'T', 'S', 'W', 'R', 'T', 'T', 'A', 'T', 'R', 'R', 'S', 'P', 'Q', 'A', 'A', 'P', 'H', 'P', 'P', 'P', 'P', 'A', 'P', 'R', 'E', 'M', 'E', '_', 'S', 'P', '_', 'T', 'S']
['A', 'L', 'Q', 'D', 'R', 'L', 'H', 'Q', 'K', 'R', 'P', 'S', 'S', 'R', 'S', 'L', 'S', 'F', 'C', 'H', 'G', 'P', 'V', 'D', 'A', 'P', 'P', 'A', 'P', 'A', 'G', 'A', 'A', 'G', 'P', 'L', 'G', 'T', '_', 'P', 'S

#### Extracting protein sequences from open reading frames (AA sequence, list)

In [35]:
test_AA_sequence1 = ['S', 'P', 'P', 'G', 'Q', 'A', 'A', 'S', 'E', 'E', 'A', 'I', 'K', 'Q', 'I', 'T', 'V', 'L', 'L', 'P', 'W', 'P', 'C', 'G', 'C', 'A', 'S', 'C', 'P', 'C', 'W', 'R', 'C', 'W', 'P', 'S', 'G', 'D', 'L', 'T', 'Q', 'P', 'Q', 'P', 'L', '_', 'T', 'N', 'T', 'C', 'A', 'A', 'H', 'T', 'W', 'W', 'K', 'L', 'S', 'T', '_', 'C', 'A', 'G', 'N', 'E', 'A', 'S', 'S', 'T', 'H', 'P', 'R', 'P', 'A', 'G', 'R', 'Q', 'R', 'T', 'C', 'R', 'W', 'G', 'R', 'W', 'S', 'W', 'A', 'G', 'A', 'L', 'V', 'Q', 'A', 'A', 'C', 'S', 'P', 'W', 'P', 'W', 'R', 'G', 'P', 'C', 'R', 'S', 'V', 'A', 'L', 'W', 'N', 'N', 'A', 'V', 'P', 'A', 'S', 'A', 'P', 'S', 'T', 'S', 'W', 'R', 'T', 'T', 'A', 'T', 'R', 'R', 'S', 'P', 'Q', 'A', 'A', 'P', 'H', 'P', 'P', 'P', 'P', 'A', 'P', 'R', 'E', 'M', 'E', '_', 'S', 'P', '_', 'T', 'S']
test_AA_sequence2 = ['P', 'S', 'R', 'T', 'G', 'C', 'I', 'R', 'R', 'G', 'H', 'Q', 'A', 'D', 'H', 'C', 'P', 'S', 'A', 'M', 'A', 'L', 'W', 'M', 'R', 'L', 'L', 'P', 'L', 'L', 'A', 'L', 'L', 'A', 'L', 'W', 'G', 'P', 'D', 'P', 'A', 'A', 'A', 'F', 'V', 'N', 'Q', 'H', 'L', 'C', 'G', 'S', 'H', 'L', 'V', 'E', 'A', 'L', 'Y', 'L', 'V', 'C', 'G', 'E', 'R', 'G', 'F', 'F', 'Y', 'T', 'P', 'K', 'T', 'R', 'R', 'E', 'A', 'E', 'D', 'L', 'Q', 'V', 'G', 'Q', 'V', 'E', 'L', 'G', 'G', 'G', 'P', 'G', 'A', 'G', 'S', 'L', 'Q', 'P', 'L', 'A', 'L', 'E', 'G', 'S', 'L', 'Q', 'K', 'R', 'G', 'I', 'V', 'E', 'Q', 'C', 'C', 'T', 'S', 'I', 'C', 'S', 'L', 'Y', 'Q', 'L', 'E', 'N', 'Y', 'C', 'N', '_', 'T', 'Q', 'P', 'A', 'G', 'S', 'P', 'T', 'P', 'A', 'A', 'S', 'C', 'T', 'E', 'R', 'D', 'G', 'I', 'K', 'P', 'L', 'N', 'Q']

In [33]:
# compute all possible proteins from an AA sequence using the 'proteins_from_rf' function
print (proteins_from_rf(test_AA_sequence1))

['ME']


In [36]:
print (proteins_from_rf(test_AA_sequence2))

['MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN', 'MRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN']


#### Extract protiens from all 6 reading frames in a DNA sequence

In [44]:
# compute all proteins from 6 open reading frames in DNA sequence and arrange them by lenght
print('\n All proteins in 6 open reading frames:\n')
for prot in all_proteins_from_orfs(DNA,0,0,True):
    print(f'{prot}\n')


 All proteins in 6 open reading frames:

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

MRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

MLVQHCSTMPRFCRDPSRAKGCRLPAPGPPPSSTCPTCRSSASRRVLGV

MPRFCRDPSRAKGCRLPAPGPPPSSTCPTCRSSASRRVLGV

MAEGQ

ME

