<a href="https://colab.research.google.com/github/etemadism/Courses/blob/main/simple_alignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Note:**

This is just a simple code to perform pairwise alignment using BLAST+ software package. It is intended for use in a Bioinformatics class during the pairwise alignment session.
It will be updated each session.

**A. Etemadi**

Tehran University of Medical Sciences

#Step 1: Install BLAST+ software package

In [1]:
import os

# Download the BLAST+ software package
os.system('wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.15.0+-x64-linux.tar.gz')

# Extract the downloaded tar.gz file
os.system('tar xvzf /content/ncbi-blast-2.15.0+-x64-linux.tar.gz')


0

In [2]:
# Test if BLAST+ is installed by checking the help message of the blastn executable
os.system('/content/ncbi-blast-2.15.0+/bin/blastn -h')

0

#Step 2: Install Entrez Direct (EDirect)

Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a Unix terminal window. Search terms are entered as command-line arguments. Individual operations are connected with Unix pipes to construct multi-step queries. Selected records can then be retrieved in a variety of formats.

In [3]:
###You can find more info here: https://www.ncbi.nlm.nih.gov/books/NBK179288/
import os

# Install Entrez Direct
os.system('sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)"')

# Set PATH for the current session
os.environ['PATH'] = f"{os.environ['HOME']}/edirect:{os.environ['PATH']}"


# Pairwise alignment


## Perform protein pairwise alignment


In [50]:
!efetch -db protein -format fasta -id NP_000509

>NP_000509.1 hemoglobin subunit beta [Homo sapiens]
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG
AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN
ALAHKYH


In [47]:
##This script fetches sequences from the NCBI database (here protein db) using their IDs
## and saves them as FASTA files.
!efetch -db protein -format fasta -id NP_000509 > NP_000509.fasta
!efetch -db protein -format fasta -id NP_001188320 > NP_001188320.fasta


In [49]:
### This will provide information about the usage, options, and parameters available for the blastp command.
!/content/ncbi-blast-2.15.0+/bin/blastp -h

USAGE
  blastp [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
    [-negative_taxidlist filename] [-no_taxid_expansion] [-ipglist filename]
    [-negative_ipglist filename] [-entrez_query entrez_query]
    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
    [-subject subject_input_file] [-subject_loc range] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty]
    [-qcov_hsp_perc float_value] [-max_hsps int_value]
    [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-thre

In [48]:
# This will execute the blastp command with the specified subject
# and query sequences in the Colab environment.
# Make sure to replace "NP_000509.fasta" and "NP_001188320.fasta"
# with the actual filenames or paths of your subject and query sequence files.



!/content/ncbi-blast-2.15.0+/bin/blastp -subject NP_000509.fasta -query NP_001188320.fasta

BLASTP 2.15.0+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.



Database: User specified sequence set (Input: NP_000509.fasta).
           1 sequences; 147 total letters



Query= NP_001188320.1 hemoglobin, beta adult s chain [Mus musculus]

Length=147
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

NP_000509.1 hemo

## Perform NA pairwise alignment


In [7]:
!efetch -db nuccore -format fasta -id NM_001201391.1

>NM_001201391.1 Mus musculus hemoglobin, beta adult s chain (Hbb-bs), mRNA
CACATTTGCTTCTGACATAGTTGTGTTGACTCACAACCCCAGAAACAGACATCATGGTGCACCTGACTGA
TGCTGAGAAGGCTGCTGTCTCTGGCCTGTGGGGAAAGGTGAACGCCGATGAAGTTGGTGGTGAGGCCCTG
GGCAGGCTGCTGGTTGTCTACCCTTGGACCCAGCGGTACTTTGATAGCTTTGGAGACCTATCCTCTGCCT
CTGCTATCATGGGTAATGCCAAAGTGAAGGCCCATGGCAAGAAAGTGATAACTGCCTTTAACGATGGCCT
GAATCACTTGGACAGCCTCAAGGGCACCTTTGCCAGCCTCAGTGAGCTCCACTGTGACAAGCTGCATGTG
GATCCTGAGAACTTCAGGCTCCTGGGCAATATGATCGTGATTGTGCTGGGCCACCACCTGGGCAAGGATT
TCACCCCCGCTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCGCTGCCCTGGCTCACAAGTA
CCACTAAACCCCCTTTCCTGCTCTTGCCTGTGAACAATGGTTAATTGTTCCCAAGAGAGCATCTGTCAGT
TGTTGGCAAAATGATAAAGACATTTGAAAATCTGTCTTCTGACAAATAAAAAGCATTTATTTCACTGCAA
TGATGTTTT


In [6]:
##This script fetches sequences from the NCBI database (here nuccore or nucleotide db) using their IDs
## and saves them as FASTA files.
!efetch -db nuccore -format fasta -id NM_000518.5 > NM_000518.fasta
!efetch -db nuccore -format fasta -id NM_001201391.1 > NM_001201391.fasta


In [9]:
### This will provide information about the usage, options, and parameters available for the blastn command.
!/content/ncbi-blast-2.15.0+/bin/blastn -h

USAGE
  blastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-taxids taxids] [-negative_taxids taxids] [-taxidlist filename]
    [-negative_taxidlist filename] [-no_taxid_expansion]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-perc_identity float_value]
    [-qcov_hsp_perc float_value] [-max_hsps int_value]
    [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value] [-penalty penalty]
    [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value]
    [-template_ty

In [10]:
# This will execute the blastn command with the specified subject
# and query sequences in the Colab environment.

!/content/ncbi-blast-2.15.0+/bin/blastn -subject NM_000518.fasta -query NM_001201391.fasta

BLASTN 2.15.0+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: User specified sequence set (Input: NM_000518.fasta).
           1 sequences; 628 total letters



Query= NM_001201391.1 Mus musculus hemoglobin, beta adult s chain (Hbb-bs),
mRNA

Length=639
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

NM_000518.5 Homo sapiens hemoglobin subunit beta (HBB), mRNA          444     7e-129


> NM_000518.5 Homo sapiens hemoglobin subunit beta (HBB), mRNA
Length=628

 Score = 444 bits (240),  Expect = 7e-129
 Identities = 422/512 (82%), Gaps = 4/512 (1%)
 Strand=Plus/Plus

Query  2    ACATTTGCTTCTGACATAGTTGTGTTGACTCACAACCCCAGAAACAGACATCATGGTGCA  61
            |||||||||||||||| |  |||||| |||  ||||| |  ||||||||| |||||||||
Sbjct  1    ACATTTGCTTCTGACACA