# DNA exercises
This section will show how to calculate easily with python many operations performed with DNA sequences such as:
* GC content in DNA

By definition: The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. GC-content percentage is calculated as follows
$\frac{G+C}{A+T+G+C} \times 100$


In [7]:
#!wget https://raw.githubusercontent.com/lanacaldarevic/rosalind/main/gc_content/gc_content.fasta

--2022-10-09 20:05:58--  https://raw.githubusercontent.com/lanacaldarevic/rosalind/main/gc_content/gc_content.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9225 (9.0K) [text/plain]
Saving to: ‘gc_content.fasta’


2022-10-09 20:05:59 (37.1 MB/s) - ‘gc_content.fasta’ saved [9225/9225]



In [1]:
%load_ext autoreload
%autoreload 2

In [4]:
#import sys
#sys.path.insert(0,"../bioinformatics_guide")


In [8]:
from bioinformatics_guide.utils.paths import data_dir
seq_file = data_dir('raw/gc_content.fasta') #dir of the sequences file

In [10]:
from Bio import SeqIO
max_sequence_id = None
max_gc_content  = 0

for seq_record in SeqIO.parse(seq_file, 'fasta'):
    sequence = str(seq_record.seq) #convert sequence to string
    sequence_id = seq_record.id
    gc_content = (sequence.count('C')+ sequence.count('G'))/len(sequence)*100
    if gc_content > max_gc_content:
        max_sequence_id = sequence_id
        max_gc_content = gc_content

print(f'The higher GC content id: {max_sequence_id}')
print(f'The higher GC content percentage is: {max_gc_content}')

The higher GC content id: Rosalind_0132
The higher GC content percentage is: 53.96308360477742


# Central dogma
The Central Dogma of Molecular Biology states that DNA makes RNA makes proteins. We can use biopython to obtain the steps of the central dogma in an automatized way.
![](/home/alejandrodf1/Documents/bioinformatics/bioinformatics_guide/references/central_dogma.png)

### Biopython 
Powerful tool that combined python string with biological methods. this section provide introduction to some of the methods that can be used with this tool.
1. count method: Seq object has count() method just like a string. The results is a non-overlapping count
2. GC method: this method allow to calculate the GC content in a given sequence 


In [13]:
from Bio.Seq import Seq
sequence = Seq('AGTACACTGGT')
print(sequence)

AGTACACTGGT


In [18]:
# count() method
sequence = Seq('AAAA')
sequence.count('AA') #non-overlapping count

2

In [19]:
#overlapping count
sequence.count_overlap('AA')

3

In [21]:
#GC content method
from Bio.SeqUtils import GC
sequence = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
GC(sequence)

46.875

## Transcription: DNA -> RNA
The actual biological transcription process works from the template strand, doing a reverse complement (TCAG → CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching T → U.

In [22]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
template_dna = coding_dna.reverse_complement()
template_dna  #template with direction 5' to 3'

CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT


In [23]:
#transciption
mess_rna = coding_dna.transcribe()
mess_rna 

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

## Translation: RNA -> PROTEIN
Where the funcion .translate can identify the stop codon in the rna with the '*' simbol.

In [24]:
mess_rna.translate() # * = stop codon

Seq('MAIVMGR*KGAR*')

In [25]:
#tranlation can be directly from coding DNA
coding_dna.translate()

Seq('MAIVMGR*KGAR*')