Overview of Genomic Data Types (DNA, RNA, Proteins)

## DNA (Deoxyribonucleic Acid)

* DNA is the molecule that carries genetic information in living organisms and many viruses.
* It consists of two strands that coil around each other to form a double helix.
* The strands are composed of simpler molecules called nucleotides, each containing a sugar, a phosphate group, and a nitrogenous base (adenine, thymine, cytosine, guanine).


#### 1. Basic Operations on DNA Sequence


Given the DNA sequence "AGTCTGACGTCGATCGAATTCGCGATCGTACG", perform the following tasks:

Print the given DNA sequence.
Calculate and print the complementary sequence.
Calculate and print the reverse complementary sequence.
Calculate and print the GC content of the DNA sequence.

In [13]:
from Bio.Seq import Seq
from Bio.SeqUtils import GC123

# Given DNA sequence
dna_seq = Seq("AGTCTGACGTCGATCGAATTCGCGATCGTACG")

# 1. Print the given DNA sequence
print("DNA Sequence:", dna_seq)

# 2. Calculate and print the complementary sequence
comp_seq = dna_seq.complement()
print("Complementary Sequence:", comp_seq)

# 3. Calculate and print the reverse complementary sequence
rev_comp_seq = dna_seq.reverse_complement()
print("Reverse Complementary Sequence:", rev_comp_seq)

# 4. Calculate and print the GC content
gc_content = GC123(dna_seq)
print("GC Content (%):", gc_content)


DNA Sequence: AGTCTGACGTCGATCGAATTCGCGATCGTACG
Complementary Sequence: TCAGACTGCAGCTAGCTTAAGCGCTAGCATGC
Reverse Complementary Sequence: CGTACGATCGCGAATTCGATCGACGTCAGACT
GC Content (%): (53.125, 45.45454545454545, 45.45454545454545, 70.0)


#### Transcription and Translation
 

Given the DNA sequence "ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", perform the following tasks:

Transcribe the DNA sequence into RNA.
Translate the RNA sequence into a protein sequence.
Identify and print the start and stop codons in the RNA sequence.
Calculate the molecular weight of the translated protein.

In [18]:
from Bio.Seq import Seq
from Bio.SeqUtils import molecular_weight



# Given DNA sequence
dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTG")
# 1. Transcribe the DNA sequence into RNA
rna_seq = dna_seq.transcribe()
print("RNA Sequence:", rna_seq)

# 2. Translate the RNA sequence into a protein sequence
protein_seq = rna_seq.translate()
print("Protein Sequence:", protein_seq)

# 3. Identify and print the start and stop codons in the RNA sequence
start_codon = "AUG"
stop_codons = ["UAA", "UAG", "UGA"]

start_pos = rna_seq.find(start_codon)
print("Start Codon Position:", start_pos)

for stop_codon in stop_codons:
    stop_pos = rna_seq.find(stop_codon)
    if stop_pos != -1:
        print(f"Stop Codon ({stop_codon}) Position:", stop_pos)

# 4. Calculate the molecular weight of the translated protein
mw = molecular_weight(protein_seq, seq_type='protein')
print("Molecular Weight of the Protein:", mw)


RNA Sequence: AUGGCCAUUGUAAUGGGCCGCUG
Protein Sequence: MAIVMGR
Start Codon Position: 0
Stop Codon (UAA) Position: 10
Molecular Weight of the Protein: 777.0107999999999


### Reading and Writing DNA Sequences
 
Perform the following tasks:

* Read a DNA sequence from the provided FASTA file example.fasta.
* Print the sequence ID and the DNA sequence.
* Write the reverse complementary sequence to a new FASTA file reverse_complement.fasta.

In [20]:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

# 1. Read a DNA sequence from the provided FASTA file
record = SeqIO.read("./data/rcsb_pdb_4CS4.fasta", "fasta")

# 2. Print the sequence ID and the DNA sequence
print("Sequence ID:", record.id)
print("DNA Sequence:", record.seq)

# 3. Write the reverse complementary sequence to a new FASTA file
rev_comp_seq = record.seq.reverse_complement()
seq_record = SeqRecord(rev_comp_seq, id=record.id, description="Reverse Complementary Sequence")
SeqIO.write(seq_record, "reverse_complement.fasta", "fasta")


Sequence ID: 4CS4_1|Chain
DNA Sequence: MPALTKSQTDRLEVLLNPKDEISLNSGKPFRELESELLSRRKKDLQQIYAEERENYLGKLEREITRFFVDRGFLEIKSPILIPLEYIERMGIDNDTELSKQIFRVDKNFCLRPMLAPNLANYLRKLDRALPDPIKIFEIGPCYRKESDGKEHLEEFTMLNFCQMGSGCTRENLESIITDFLNHLGIDFKIVGDSCMVFGDTLDVMHGDLELSSAVVGPIPLDREWGIDKPWIGAGFGLERLLKVKHDFKNIKRAARSESYYNGISTNLHHHHHH


1

## Proteins


#### Basic Protein Sequence Manipulation
* Task: Create a protein sequence "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG" using Biopython. Print the sequence, its length, and the count of each amino acid in the sequence.

In [1]:
from Bio.Seq import Seq

# Create a protein sequence
protein_seq = Seq("MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG")

# Print the sequence
print("Protein Sequence:", protein_seq)

# Print the length of the sequence
print("Length of the Sequence:", len(protein_seq))

# Print the count of each amino acid
amino_acid_counts = {aa: protein_seq.count(aa) for aa in set(protein_seq)}
print("Amino Acid Counts:", amino_acid_counts)


Protein Sequence: MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
Length of the Sequence: 60
Amino Acid Counts: {'S': 2, 'N': 1, 'K': 3, 'C': 1, 'F': 1, 'R': 1, 'L': 6, 'V': 7, 'T': 5, 'P': 1, 'Y': 3, 'E': 4, 'Q': 3, 'M': 1, 'G': 6, 'D': 6, 'I': 5, 'A': 3, 'H': 1}


#### Molecular Weight Calculation
* Task: Using the sequence from Question 1, calculate and print the molecular weight of the protein.

In [2]:
from Bio.SeqUtils import molecular_weight

# Calculate the molecular weight of the protein
mw = molecular_weight(protein_seq, seq_type='protein')
print("Molecular Weight of the Protein:", mw)


Molecular Weight of the Protein: 6543.3415


#### Hydrophobicity Calculation
* Task: Using the sequence from Question 1, calculate and print the Grand Average of Hydropathy (GRAVY) score for the protein.

In [3]:
from Bio.SeqUtils.ProtParam import ProteinAnalysis

# Analyze the protein sequence
analysed_protein = ProteinAnalysis(str(protein_seq))

# Calculate the hydrophobicity (GRAVY)
hydrophobicity = analysed_protein.gravy()
print("Protein Hydrophobicity (GRAVY):", hydrophobicity)


Protein Hydrophobicity (GRAVY): 0.09833333333333334


#### Secondary Structure Prediction
* Task: Write a function that predicts the secondary structure (Helix, Sheet, or Coil) of a given protein sequence. For simplicity, use a dictionary to assign secondary structures to amino acids and print the predicted structure for each residue in the sequence from Question 1.

In [4]:
# Dictionary for secondary structure prediction (example)
secondary_structure = {"A": "Helix", "V": "Sheet", "I": "Helix", "L": "Sheet", 
                       "M": "Helix", "F": "Sheet", "W": "Helix", "Y": "Sheet",
                       "S": "Coil", "T": "Coil", "N": "Coil", "Q": "Coil"}

# Function to predict secondary structure
def predict_secondary_structure(seq):
    return [secondary_structure.get(residue, 'Coil') for residue in seq]

# Predict and print the secondary structure
predicted_structure = predict_secondary_structure(protein_seq)
print("Predicted Secondary Structure:", predicted_structure)


Predicted Secondary Structure: ['Helix', 'Coil', 'Coil', 'Sheet', 'Coil', 'Sheet', 'Sheet', 'Sheet', 'Sheet', 'Coil', 'Helix', 'Coil', 'Coil', 'Sheet', 'Coil', 'Coil', 'Coil', 'Helix', 'Sheet', 'Coil', 'Helix', 'Coil', 'Sheet', 'Helix', 'Coil', 'Coil', 'Coil', 'Sheet', 'Sheet', 'Coil', 'Coil', 'Sheet', 'Coil', 'Coil', 'Coil', 'Helix', 'Coil', 'Coil', 'Coil', 'Sheet', 'Coil', 'Coil', 'Coil', 'Sheet', 'Sheet', 'Helix', 'Coil', 'Coil', 'Coil', 'Coil', 'Coil', 'Sheet', 'Sheet', 'Coil', 'Helix', 'Sheet', 'Coil', 'Coil', 'Helix', 'Coil']


#### Protein Sequence Alignment
* Task: Align the sequence from Question 1 with another protein sequence "MTEYKLVVVGAGDVGGKSAQTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG" using global alignment. Print the alignment and the alignment score.

In [5]:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment

# Another protein sequence
protein_seq_2 = Seq("MTEYKLVVVGAGDVGGKSAQTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG")

# Perform global alignment
alignments = pairwise2.align.globalxx(protein_seq, protein_seq_2)

# Print the alignment and the alignment score
for alignment in alignments:
    print(format_alignment(*alignment))


MTEYKLVVVGAGG-VG-KSAL-TIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
||||||||||| | || |||  |||||||||||||||||||||||||||||||||||||||||
MTEYKLVVVGA-GDVGGKSA-QTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
  Score=58

MTEYKLVVVGAGG-VG-KSAL-TIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
||||||||||||  || |||  |||||||||||||||||||||||||||||||||||||||||
MTEYKLVVVGAG-DVGGKSA-QTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
  Score=58

MTEYKLVVVGAGGVG-KSAL-TIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
||||||||||||.|| |||  |||||||||||||||||||||||||||||||||||||||||
MTEYKLVVVGAGDVGGKSA-QTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
  Score=58

MTEYKLVVVGAGG-V-GKSAL-TIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
||||||||||| | | ||||  |||||||||||||||||||||||||||||||||||||||||
MTEYKLVVVGA-GDVGGKSA-QTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
  Score=58

MTEYKLVVVGAGG-V-GKSAL-TIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG
||||||||||||  | ||||  |||||||||||||||||||||||||||||||||||||||||
MTEYKLVVVGAG-DVGGKSA-QTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDIL



## RNA


#### Basic RNA Sequence Manipulation
Question 1: Create an RNA sequence and print its length. Convert it to a DNA sequence and print the DNA sequence.







In [6]:
from Bio.Seq import Seq

# Create an RNA sequence
rna_seq = Seq("AUGCUACGAUCG")
print("RNA Sequence:", rna_seq)

# Print the length of the RNA sequence
print("Length of RNA Sequence:", len(rna_seq))

# Convert RNA to DNA
dna_seq = rna_seq.back_transcribe()
print("Converted DNA Sequence:", dna_seq)


RNA Sequence: AUGCUACGAUCG
Length of RNA Sequence: 12
Converted DNA Sequence: ATGCTACGATCG


Question 2: Transcribe a given DNA sequence into RNA. Print both the original DNA sequence and the transcribed RNA sequence.


In [7]:
from Bio.Seq import Seq

# Given DNA sequence
dna_seq = Seq("ATGCTACGATCG")
print("Original DNA Sequence:", dna_seq)

# Transcribe DNA to RNA
rna_seq = dna_seq.transcribe()
print("Transcribed RNA Sequence:", rna_seq)


Original DNA Sequence: ATGCTACGATCG
Transcribed RNA Sequence: AUGCUACGAUCG


Question 3: Translate a given RNA sequence into a protein sequence. Print the RNA sequence and the corresponding protein sequence.


In [8]:
from Bio.Seq import Seq

# Given RNA sequence
rna_seq = Seq("AUGCUACGAUCG")
print("RNA Sequence:", rna_seq)

# Translate RNA to Protein
protein_seq = rna_seq.translate()
print("Protein Sequence:", protein_seq)


RNA Sequence: AUGCUACGAUCG
Protein Sequence: MLRS


Question 4: Find the positions of start and stop codons in an RNA sequence. Print the positions of the start codon (AUG) and any of the stop codons (UAA, UAG, UGA).

In [9]:
from Bio.Seq import Seq

# Given RNA sequence
rna_seq = Seq("AUGCUACGAUCGAUGAAUUAG")

# Find start codon position
start_codon = "AUG"
start_pos = rna_seq.find(start_codon)
print("Start Codon (AUG) Position:", start_pos)

# Find stop codon positions
stop_codons = ["UAA", "UAG", "UGA"]
for stop_codon in stop_codons:
    stop_pos = rna_seq.find(stop_codon)
    if stop_pos != -1:
        print(f"Stop Codon ({stop_codon}) Position:", stop_pos)


Start Codon (AUG) Position: 0
Stop Codon (UAG) Position: 18
Stop Codon (UGA) Position: 13


Question 5: Calculate and print the GC content of a given RNA sequence. The GC content is the percentage of guanine (G) and cytosine (C) bases in the sequence.

In [11]:
from Bio.SeqUtils import GC123
from Bio.Seq import Seq

# Given RNA sequence
rna_seq = Seq("AUGCUACGAUCG")

# Calculate GC content
gc_content = GC123(rna_seq)
print("GC Content (%):", gc_content)


GC Content (%): (66.66666666666667, 66.66666666666667, 100.0, 50.0)
