**Parcticing Biopython (SeqIO) for DNA and Protein FASTA sequence Analysis**

---


Abu Reza

Biotechnology and Genetic Engineering

Islamic University, Kushtia-7003, Bangladesh

#Download FASTA using biopython

In [6]:
!pip install biopython # Exclamation (!) is used to emphasize

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84


In [7]:
pip show biopython

Name: biopython
Version: 1.84
Summary: Freely available tools for computational molecular biology.
Home-page: https://biopython.org/
Author: The Biopython Contributors
Author-email: biopython@biopython.org
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy
Required-by: 


In [None]:
# prompt: download fasta using python from NCBI, (just use accession number at last)

from Bio import Entrez
from Bio import SeqIO

def download_fasta(accession_number, file_name):
    try:
        handle = Entrez.efetch(db="nucleotide", id=accession_number, rettype="fasta", retmode="text")
        record = SeqIO.read(handle, "fasta")
        SeqIO.write(record, file_name, "fasta")
        print(f"FASTA file for {accession_number} downloaded successfully as {file_name}")
        handle.close()
    except Exception as e:
        print(f"An error occurred: {e}")

download_fasta("NG_047551", "Reza.fasta")

FASTA file for NG_047551 downloaded successfully as Reza.fasta


#Single Fasta Read (DNA) using SeqIO

In [8]:
filepath="/content/sequence.fasta"  #select path

In [9]:
from Bio import SeqIO #importing SeqIO from Biopython

In [10]:
seq_object = SeqIO.read(filepath, "fasta")

In [None]:
print(seq_object)

ID: NG_047557.1
Name: NG_047557.1
Description: NG_047557.1 Staphylococcus aureus N315 bleO gene for bleomycin binding protein, complete CDS
Number of features: 0
Seq('CGGGCCATTTTGCGTAATAAGAAAAAGGATTAATTATGAGCGAATTGAATTAAT...GTT')


In [11]:
print(type(seq_object))

<class 'Bio.SeqRecord.SeqRecord'>


In [12]:
seq_id=seq_object.id
print(seq_id)

NG_047557.1


In [None]:
seq_name=seq_object.name
print(seq_name)

NG_047557.1


In [13]:
seq_description=seq_object.description
print(seq_description)

NG_047557.1 Staphylococcus aureus N315 bleO gene for bleomycin binding protein, complete CDS


In [14]:
sequence=seq_object.seq
print(sequence)

CGGGCCATTTTGCGTAATAAGAAAAAGGATTAATTATGAGCGAATTGAATTAATAATAAGGTAATAGATTTACATTAGAAAATGAAAGGGGATTTTATGCGTGAGAATGTTACAGTCTATCCCGGCATTGCCAGTCGGGGATATTAAAAAGAGTATAGGTTTTTATTGCGATAAACTAGGTTTCACTTTGGTTCACCATGAAGATGGATTCGCAGTTCTAATGTGTAATGAGGTTCGGATTCATCTATGGGAGGCAAGTGATGAAGGCTGGCGCTCTCGTAGTAATGATTCACCGGTTTGTACAGGTGCGGAGTCGTTTATTGCTGGTACTGCTAGTTGCCGCATTGAAGTAGAGGGAATTGATGAATTATATCAACATATTAAGCCTTTGGGCATTTTGCACCCCAATACATCATTAAAAGATCAGTGGTGGGATGAACGAGACTTTGCAGTAATTGATCCCGACAACAATTTGATTAGCTTTTTTCAACAAATAAAAAGCTAAAATCTATTATTAATCTGTTCAGCAATCGGGCGCGATTGCTGAATAAAAGATACGAGAGACCTCTCTTGTATCTTTTTTATTTTGAGTGGTTTTGTCCGTT


In [16]:
length_seq=len(sequence)
print(length_seq)

605


Couting bases and their percentages

In [17]:
A = sequence.count("A")
T = sequence.count("T")
G = sequence.count("G")
C = sequence.count("C")
A_percentage = (A/length_seq)*100
print(f"Total A is {A} and Percentage of A in your FASTA is {round(A_percentage, 2)}%")
T_percentage = (T/length_seq)*100
print(f"Total T is {T} and Percentage of T in your FASTA is {round(T_percentage, 2)}%")
G_percentage = (G/length_seq)*100
print(f"Total G is {G} and Percentage of G in your FASTA is {round(G_percentage, 2)}%")
C_percentage = (C/length_seq)*100
print(f"Total C is {C} and Percentage of C in your FASTA is {round(C_percentage, 2)}%")



Total A is 181 and Percentage of A in your FASTA is 29.92%
Total T is 195 and Percentage of T in your FASTA is 32.23%
Total G is 141 and Percentage of G in your FASTA is 23.31%
Total C is 88 and Percentage of C in your FASTA is 14.55%


CG content

In [18]:
GC=sequence.count("C")+sequence.count("G")
print(f"Total G and C is in your FASTA is:", GC)
content = (GC/length_seq)*100
print(f"GC content is: {round(content, 2)}%")

Total G and C is in your FASTA is: 229
GC content is: 37.85%


#Single Fasta Read (Protein) using SeqIO

In [19]:
path_pdb="/content/7GXH.fasta"
from Bio import SeqIO
protein_main=SeqIO.read(path_pdb, "fasta")
print(protein_main)

ID: 7GXH_1|Chain
Name: 7GXH_1|Chain
Description: 7GXH_1|Chain A|B-cell lymphoma 6 protein|Homo sapiens (9606)
Number of features: 0
Seq('GPGADSCIQFTRHASDVLLNLNRLRSRDILTDVVIVVSREQFRAHKTVLMACSG...ASE')


In [20]:
fasta_id=protein_main.id
print(fasta_id)

7GXH_1|Chain


In [21]:
fasta_name=protein_main.name
print(fasta_name)

7GXH_1|Chain


In [22]:
fasta_description=protein_main.description
print(fasta_description)

7GXH_1|Chain A|B-cell lymphoma 6 protein|Homo sapiens (9606)


In [23]:
protein_seq=protein_main.seq
print(protein_seq)

GPGADSCIQFTRHASDVLLNLNRLRSRDILTDVVIVVSREQFRAHKTVLMACSGLFYSIFTDQLKCNLSVINLDPEINPEGFCILLDFMYTSRLNLREGNIMAVMATAMYLQMEHVVDTCRKFIKASE


In [24]:
length=len(protein_seq)
print(length)

128


In [25]:
P = protein_seq.count("P")
print(f"Total proline in your protein sequence is:", P)

Total proline in your protein sequence is: 3


#Multiple Fasta Read (DNA)

In [34]:
from Bio import SeqIO
import pandas as pd #previously installed packages
import os


In [37]:
path="/content/sequences.fasta"
seq_objects = SeqIO.parse(path, "fasta")
sequences = []
for seq in seq_objects:
  sequences.append(seq)
  print("\n")
  print(seq)



ID: AF210055.1
Name: AF210055.1
Description: AF210055.1 Staphylococcus aureus strain CMRSA-1 accessory gene regulator locus, partial sequence
Number of features: 0
Seq('ACCAGTTTGCCACGTATCTTCAAAAGAGAAATAACTTAGATCATATTCAATTTT...TTA')


ID: U89396.1
Name: U89396.1
Description: U89396.1 Staphylococcus aureus hemCDBL gene cluster: porphobilinogen deaminase (hemC), uroporphyrinogen III synthase (hemD), d-aminolevulinic acid dehydratase (hemB) and GSA-1-aminotransferase (hemL) genes, complete cds
Number of features: 0
Seq('CGTATATTCATTGACCCGAAGGTAATACTCTCTTCAATTATTACAGTATTATAT...ACT')


ID: U41072.1
Name: U41072.1
Description: U41072.1 Staphylococcus aureus isoleucyl-tRNA synthetase (ileS) gene, partial cds
Number of features: 0
Seq('GGTAGAGCACACATTAACTGACTTAGGTGGTAAAACTAAAGAAGATAAAACGCA...GGG')


ID: M37888.1
Name: M37888.1
Description: M37888.1 Staphylococcus aureus resistance protein (qacD) gene, complete cds
Number of features: 0
Seq('AGATCTGCGGTTCTTTTTATATAGAGCGTAAATACATTCAATGCCTTTGAGT

In [39]:
len(sequences)

10

In [40]:
first_seq=sequences[0]
print(first_seq)

ID: AF210055.1
Name: AF210055.1
Description: AF210055.1 Staphylococcus aureus strain CMRSA-1 accessory gene regulator locus, partial sequence
Number of features: 0
Seq('ACCAGTTTGCCACGTATCTTCAAAAGAGAAATAACTTAGATCATATTCAATTTT...TTA')


In [50]:
seq_id=first_seq.id
seq_des=first_seq.description
print(f"Here is the sequence id: {seq_id} and Here is the sequence description for accession number {seq_id}: {seq_des}")


Here is the sequence id: AF210055.1 and Here is the sequence description for accession number AF210055.1: AF210055.1 Staphylococcus aureus strain CMRSA-1 accessory gene regulator locus, partial sequence


In [54]:
first_len = len(first_seq)
print(f"The length of the first sequence is: {first_len}")

The length of the first sequence is: 1891


In [56]:
first_seq_seq=first_seq.seq
print(first_seq_seq)
GC = first_seq_seq.count("G") + first_seq_seq.count("C")
print(f"Total G and C content is: {GC}")
gc_content = (GC/first_len)*100
print(f"GC content is: {round(gc_content, 2)}%")

ACCAGTTTGCCACGTATCTTCAAAAGAGAAATAACTTAGATCATATTCAATTTTTGCAAGTACGATTAGGGATGCAGGTCTTAGCTAAAAATATAGGAAAACTGATTGTTATGTATACGATTGCCTATATTTTAAACATTTTTCTCTTTACGTTAATTACGAATTTAACATTTTATTTAATAAGAAGACATGCTCATGGTGCCCATGCACCTTCTTCCTTTTGGTGTTATGTAGAAAGTATTATATTATTTATACTTTTACCTTTAGTAATAGTAAATTTTCATATTAACTTTTTAATTATGACTATTTTGACGTTTATTGCTATTGGTTTGATAATTAAATATGCTCCTGCAGCAACTAAAAAGAAACCTATTCCCGTTCGACTTATTAAGCGAAAAAAATATTACGCAATTATTGTTAGTTTAATCTTTTTCATTATCACACTTATCATCAAAGAACCATTTGCCCAATTTATTCAATTAGGCATCATAATAGAAGCCATTACTTTACTACCTATTTTCTTTATTAAGGAGGACTTAAAATGAATACATTATTTAATTTATTTTTTGATTTTATTACTGGGATTTTAAAAAACATTGGTAACATCGCAGCTTATAGTACATGCGACTTCATAATGGATGAAGTTGAAGTACCTAAAGAATTAACTCAATTACACGAATAATAAAAATAGAAAGTGTGATAGTAGGTGGAATTATTAAATAGTTATAATTTTGTTTTATTCGTATTAACTCAAATGATATTAATGTTTACAATACCAGCTATAATTAGTGGTATTAAGTACAGTAAACTTGATTATTTTTTCATTATAGGAATTTCGACATTATCGTTATTTCTATTTAAAATGTTTGATAGCGCGTCCTTAATCATTTTAACTTCATTCATTATTATAATGTATTTTGTCAAAATCAAATGGTATTCTATTTTGTTGATTATGACTTCACAGATTATTCTGTACTGTGCTAACTACATGTATATAGTT

#Random Python Code Practice

In [None]:
A = Name.count("A")
a = Name.count("a")
A = A + a
print(f"Total A in your name is:", A)

Total A in your name is: 2


In [None]:
name_input=input("Please enter your name: ")
print("Your name is: ", name_input)
As = name_input.count("A")
a = name_input.count("a")
As = As + a
print(f"Total A in your name is: ", As)

Please enter your name: Abu Reza Bashar
Your name is:  Abu Reza Bashar
Total A in your name is:  4
