<a href="https://colab.research.google.com/github/erinijapranckeviciene/Bioinformatics_notebooks/blob/main/getseqs_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Precomputed lab guide of your hands-on practice  

These exercises aim at providing understanding how to access the NCBI databases through the command line and perform meaningful sequence analysis using BLAST.

# While performing this exercise please take note of:
1. Programs from the BLAST suite required to download the database and to obtain information about the databases
2. Consult the parameters of blastn program to understand how to control, format and and minimize the amount of alignments in BLAST output.
3. Understand that the same read can match sequences of multiple species
4. Understand that evalue paramater is the parameter by which BLAST ranks the sequence matches
5. Understand that alignment length, percent identity and coverage are important parameters in interpreting the the BLAST results
6. Think of how would you filter the contaminants from the results
7. Think of how would you quantified the species from BLAST results

In [None]:
# Initialize a working directory
LD=~/lesson
cd $LD
ls

[0m[01;34mdata[0m  [01;34mdata_prev[0m  [01;34mipynb[0m  [01;34mpages[0m


In [None]:
# Download the BLAST program suite
# https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html#blast-executables

In [None]:
# Make folder for installs
mkdir -p $LD/installs && cd $LD/installs

# Get BLAST programs
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.1+-x64-linux.tar.gz

# Extract the archive
tar -xvzf ncbi-blast-2.14.1+-x64-linux.tar.gz
ls

--2023-11-01 13:18:38--  https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.14.1+-x64-linux.tar.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.7, 130.14.250.13, 2607:f220:41e:250::11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 251154151 (240M) [application/x-gzip]
Saving to: ‘ncbi-blast-2.14.1+-x64-linux.tar.gz’


2023-11-01 13:19:00 (11.2 MB/s) - ‘ncbi-blast-2.14.1+-x64-linux.tar.gz’ saved [251154151/251154151]

ncbi-blast-2.14.1+/
ncbi-blast-2.14.1+/README
ncbi-blast-2.14.1+/doc/
ncbi-blast-2.14.1+/doc/README.txt
ncbi-blast-2.14.1+/BLAST_PRIVACY
ncbi-blast-2.14.1+/ChangeLog
ncbi-blast-2.14.1+/bin/
ncbi-blast-2.14.1+/bin/tblastx
ncbi-blast-2.14.1+/bin/blast_formatter_vdb
ncbi-blast-2.14.1+/bin/blastp
ncbi-blast-2.14.1+/bin/blastdbcmd
ncbi-blast-2.14.1+/bin/makeprofiledb
ncbi-blast-2.14.1+/bin/makeblastdb
ncbi-blast-2.14.1+/bin/segmask

In [None]:
# What programs were downloaded
ls $LD/installs/ncbi-blast-2.14.1+/bin/

[0m[01;32mblastdb_aliastool[0m    [01;32mcleanup-blastdb-volumes.py[0m  [01;32mrpsblast[0m
[01;32mblastdbcheck[0m         [01;32mconvert2blastmask[0m           [01;32mrpstblastn[0m
[01;32mblastdbcmd[0m           [01;32mdeltablast[0m                  [01;32msegmasker[0m
[01;32mblast_formatter[0m      [01;32mdustmasker[0m                  [01;32mtblastn[0m
[01;32mblast_formatter_vdb[0m  [01;32mget_species_taxids.sh[0m       [01;32mtblastn_vdb[0m
[01;32mblastn[0m               [01;32mlegacy_blast.pl[0m             [01;32mtblastx[0m
[01;32mblastn_vdb[0m           [01;32mmakeblastdb[0m                 [01;32mupdate_blastdb.pl[0m
[01;32mblastp[0m               [01;32mmakembindex[0m                 [01;32mwindowmasker[0m
[01;32mblast_vdb_cmd[0m        [01;32mmakeprofiledb[0m
[01;32mblastx[0m               [01;32mpsiblast[0m


In [None]:
# in order to access the blast programs export PATH
export PATH=$LD/installs/ncbi-blast-2.14.1+/bin/:$PATH

In [None]:
# Program update_blastdb.pl is used to manage BLAST database download and update
# what commands are available?
update_blastdb.pl --help

In [None]:
# show all databases available for download
update_blastdb.pl --showall

ITS_RefSeq_Fungi
18S_fungal_sequences
ITS_eukaryote_sequences
16S_ribosomal_RNA
Betacoronavirus
28S_fungal_sequences
LSU_eukaryote_rRNA
LSU_prokaryote_rRNA
SSU_eukaryote_rRNA
env_nt
env_nr
nt_prok
nt_viruses
pataa
patnt
human_genome
landmark
mito
mouse_genome
nr
nt_euk
nt
nt_others
pdbaa
pdbnt
ref_euk_rep_genomes
ref_prok_rep_genomes
ref_viroids_rep_genomes
ref_viruses_rep_genomes
refseq_select_rna
refseq_select_prot
refseq_protein
refseq_rna
swissprot
tsa_nr
tsa_nt
taxdb


#### BLAST databases 16S_ribosomal_RNA and taxdb

In [None]:
# create a folder for BLAST database download and descend into it to initiate the download
mkdir -p $LD/installs/blastdb && cd $LD/installs/blastdb

In [None]:
# download 16S_ribosomal_RNA and taxonomy databases.Without taxonomy there will be no species scientific names
update_blastdb.pl --decompress --verbose 16S_ribosomal_RNA taxdb

Downloading https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz... [OK]
Downloading https://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz... [OK]
Decompressing 16S_ribosomal_RNA.tar.gz ... [OK]
Decompressing taxdb.tar.gz ... [OK]


In [None]:
# List downloaded database files
ls

16S_ribosomal_RNA.ndb  16S_ribosomal_RNA.nos         taxdb.btd
16S_ribosomal_RNA.nhr  16S_ribosomal_RNA.not         taxdb.bti
16S_ribosomal_RNA.nin  16S_ribosomal_RNA.nsq         taxdb.tar.gz.md5
16S_ribosomal_RNA.nnd  16S_ribosomal_RNA.ntf         taxonomy4blast.sqlite3
16S_ribosomal_RNA.nni  16S_ribosomal_RNA.nto
16S_ribosomal_RNA.nog  16S_ribosomal_RNA.tar.gz.md5


In [None]:
# export BLASTDB path for search
# about BLAST refer to class materials
export BLASTDB=$LD/installs/blastdb

# verify that path exists
echo $BLASTDB

/home/erin/lesson/installs/blastdb


In [None]:
# another useful program is blastdbcmd
# Explore parameters of blastdbcmd. How can you obtain a list of program parameters and descriptions?
# blastdbcmd --help
# what is wrong in the above command? What is the correct command?

In [None]:
# here blastdbcmd is used to gather information about the database. What other information can be obtained using blastdbcmd?
blastdbcmd -metadata -db 16S_ribosomal_RNA

{
  "version": "1.2",
  "dbname": "16S_ribosomal_RNA",
  "dbtype": "Nucleotide",
  "db-version": 5,
  "description": "16S ribosomal RNA (Bacteria and Archaea type strains)",
  "number-of-letters": 39006255,
  "number-of-sequences": 26910,
  "last-updated": "2023-10-21T05:36:00",
  "number-of-volumes": 1,
  "number-of-taxids": 20929,
  "bytes-total": 17871139,
  "bytes-to-cache": 10202658,
  "files": [
    "16S_ribosomal_RNA.ndb",
    "16S_ribosomal_RNA.nhr",
    "16S_ribosomal_RNA.nin",
    "16S_ribosomal_RNA.nnd",
    "16S_ribosomal_RNA.nni",
    "16S_ribosomal_RNA.nog",
    "16S_ribosomal_RNA.nos",
    "16S_ribosomal_RNA.not",
    "16S_ribosomal_RNA.nsq",
    "16S_ribosomal_RNA.ntf",
    "16S_ribosomal_RNA.nto"
  ]
}


#### Obtain SRA Toolkit

In [None]:
cd $LD/installs

# More information on get SRA Toolkit
# https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

# get SRA Toolkit
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-centos_linux64.tar.gz

# unpack
tar -xzvf sratoolkit.3.0.7-centos_linux64.tar.gz

--2023-11-01 13:22:57--  https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-centos_linux64.tar.gz
Resolving ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.11, 2607:f220:41e:250::10, ...
Connecting to ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)|130.14.250.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93172877 (89M) [application/x-gzip]
Saving to: ‘sratoolkit.3.0.7-centos_linux64.tar.gz’


2023-11-01 13:23:07 (10.4 MB/s) - ‘sratoolkit.3.0.7-centos_linux64.tar.gz’ saved [93172877/93172877]

sratoolkit.3.0.7-centos_linux64/
sratoolkit.3.0.7-centos_linux64/README.md
sratoolkit.3.0.7-centos_linux64/README-vdb-config
sratoolkit.3.0.7-centos_linux64/schema/
sratoolkit.3.0.7-centos_linux64/schema/vdb/
sratoolkit.3.0.7-centos_linux64/schema/vdb/vdb.vschema
sratoolkit.3.0.7-centos_linux64/schema/vdb/built-in.vschema
sratoolkit.3.0.7-centos_linux64/schema/insdc/
sratoolkit.3.0.7-centos_linux64/schema/in

In [None]:
# see the contents of the folder
ls $LD/installs/sratoolkit.3.0.7-centos_linux64/

[0m[01;34mbin[0m  CHANGES  [01;34mexample[0m  README-blastn  README.md  README-vdb-config  [01;34mschema[0m


In [None]:
# see the contents of the bin folder - what commands are available in sra-toolkit
 ls $LD/installs/sratoolkit.3.0.7-centos_linux64/bin/

[0m[01;36mabi-dump[0m                 [01;36mkar[0m                    [01;36msra-sort[0m
[01;36mabi-dump.3[0m               [01;36mkar.3[0m                  [01;36msra-sort.3[0m
[01;32mabi-dump.3.0.7[0m           [01;32mkar.3.0.7[0m              [01;32msra-sort.3.0.7[0m
[01;36mabi-load[0m                 [01;36mkdbmeta[0m                [01;36msra-sort-cg[0m
[01;36mabi-load.3[0m               [01;36mkdbmeta.3[0m              [01;36msra-sort-cg.3[0m
[01;32mabi-load.3.0.7[0m           [01;32mkdbmeta.3.0.7[0m          [01;32msra-sort-cg.3.0.7[0m
[01;36malign-info[0m               [01;36mlatf-load[0m              [01;36msra-stat[0m
[01;36malign-info.3[0m             [01;36mlatf-load.3[0m            [01;36msra-stat.3[0m
[01;32malign-info.3.0.7[0m         [01;32mlatf-load.3.0.7[0m        [01;32msra-stat.3.0.7[0m
[01;36mbam-load[0m                 [01;34mncbi[0m                   [01;32msratools.3.0.7[0m
[01;36mbam-load.3[0m     

In [None]:
# export the path to call the toolkit tools
export PATH=$LD/installs/sratoolkit.3.0.7-centos_linux64/bin/:$PATH


In [None]:
# test that calling SRA Toolkit commands works, see if you can invoke fastq-dump and it is made known to the system
# Note how to use the fastq-dump to obtain sequences from the NCBI Short Read Arcive
fastq-dump -help

#### Dowload sequences from BIOPROJECT PRJNA89935

In [None]:
# Sample_25 from BIOPROJECT
# https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR22276225&display=metadata

In [None]:
mkdir -p $LD/data && cd $LD/data

In [None]:
# this command shows the weblink - URL - of the accession
srapath SRR22276225

https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR22276225/SRR22276225


In [None]:
# this command retrieves sequences of the given accession in fasta format. Based on the link we can derive that the NCBI hosts sequences on AWS
fastq-dump --fasta SRR22276225

Read 41366 spots for SRR22276225
Written 41366 spots for SRR22276225


In [None]:
# inspect the folder to see the output of fastq-dump
ls -la

total 21928
drwxrwxr-x 2 erin erin     4096 Nov  1 11:47 [0m[01;34m.[0m
drwxrwxr-x 7 erin erin     4096 Nov  1 11:45 [01;34m..[0m
-rw-rw-r-- 1 erin erin 22445815 Nov  1 11:47 SRR22276225.fasta


In [None]:
# To test BLAST we need just one read, therefore we need to determine where the sequence ends
# Take a note of the fasta file format : header line and sequence
head -n20 SRR22276225.fasta

>SRR22276225.1 1 length=339
CACTTGTACTTCGTTCGGTTACGTATTGCTAGGTTAACTACTTACGAAGCTGAGGGACTGCCAGCACCTG
TGCCAGCCGCCGCGGTAATACGTAGGGGCGAGCGTTGTCCCGGAATTACTGGGCGTAAGGGCACGCAGGC
TGTGCTTCAAGTCAGCTGTAAAGGATGCGGCCGCCGTGTTATGCGGCTGAGACTGAGGTGCTGGAGTACC
GGAGGCAGAGTGGAATTCCCAGTGTAGCGTTAAATGCGTAGATATTGGAAGAACATCGGTGTCGAAGGCG
ACTTGCGGACGGTAACTGACGCTGAGATTTGCGAAAGCCTGGTAGCAAACCGGATTAGA
>SRR22276225.2 2 length=357
AGCATCTTGTACTTCGTTCAGTTACGTATTGCTAAGGTTAACTACTTACGAAGCTGAGGGACTGCCAGCA
CCTGTGCCAGCAACCGCGGTATGCAATGTCACGACGTTATCCGGATTTATTGGGCGTAAAGCGCGTCTAG
GTGGTCATGTAAGTCTGATGTGAAAATGCAGGGCTCAACTCTGTATTGCGTTGGAAACTGTGTAACTAAG
AGTACTGGAGAGGTAAGCGGAACTACAAGTGTAGAGGTGAAATTCGTAGATATTTGTAGGAATGCCGATG
GGAAGCCAGCTTGCCTGGACAGATACTGACGCTGGCGCGAAGCGTGGGTAGCAAGCAGGATTGGAATACT
GTAGTCC
>SRR22276225.3 3 length=248
CCGTCCATGTGCTGCGTTCCGGTTACGTATTGCTAAGGTTAACTACTTACGAAGCTGAGGGACTGCCAGC
ACCTCCGTCAATTCCTTTGAGTTTCATACTTGCGTACGTACTCCCCAGGCGGATTACCTATCGCGTTGCT
TGGGCGCTGAGGTTCGACCCCCAACACCTAGTAATCATCGTTTACGGCGTGGACTACCCGGGTATCT

In [None]:
# The first read takes 6 lines
head -n6 SRR22276225.fasta

>SRR22276225.1 1 length=339
CACTTGTACTTCGTTCGGTTACGTATTGCTAGGTTAACTACTTACGAAGCTGAGGGACTGCCAGCACCTG
TGCCAGCCGCCGCGGTAATACGTAGGGGCGAGCGTTGTCCCGGAATTACTGGGCGTAAGGGCACGCAGGC
TGTGCTTCAAGTCAGCTGTAAAGGATGCGGCCGCCGTGTTATGCGGCTGAGACTGAGGTGCTGGAGTACC
GGAGGCAGAGTGGAATTCCCAGTGTAGCGTTAAATGCGTAGATATTGGAAGAACATCGGTGTCGAAGGCG
ACTTGCGGACGGTAACTGACGCTGAGATTTGCGAAAGCCTGGTAGCAAACCGGATTAGA


In [None]:
# save one read after determining the number of lines
head -n6 SRR22276225.fasta > read1.fasta
cat read1.fasta

>SRR22276225.1 1 length=339
CACTTGTACTTCGTTCGGTTACGTATTGCTAGGTTAACTACTTACGAAGCTGAGGGACTGCCAGCACCTG
TGCCAGCCGCCGCGGTAATACGTAGGGGCGAGCGTTGTCCCGGAATTACTGGGCGTAAGGGCACGCAGGC
TGTGCTTCAAGTCAGCTGTAAAGGATGCGGCCGCCGTGTTATGCGGCTGAGACTGAGGTGCTGGAGTACC
GGAGGCAGAGTGGAATTCCCAGTGTAGCGTTAAATGCGTAGATATTGGAAGAACATCGGTGTCGAAGGCG
ACTTGCGGACGGTAACTGACGCTGAGATTTGCGAAAGCCTGGTAGCAAACCGGATTAGA


In [None]:
# Inspect the working data directory
ls -la

total 21932
drwxrwxr-x 2 erin erin     4096 Nov  1 11:53 [0m[01;34m.[0m
drwxrwxr-x 7 erin erin     4096 Nov  1 11:45 [01;34m..[0m
-rw-rw-r-- 1 erin erin      372 Nov  1 11:54 read1.fasta
-rw-rw-r-- 1 erin erin 22445815 Nov  1 11:47 SRR22276225.fasta


In [None]:
# This is path to the local BLAST database. The variable $BLSTDB must be initialized. See the BLAST manual.
echo $BLASTDB

/home/erin/lesson/installs/blastdb


In [None]:
# We will use the program blastn from the BLAST programs suite
# How to find out the parameters of the blastn
# blastn -help

#### Blast one read

In [None]:
# BLAST - Basic Local Aligment Search Tool
# It searches for matches in the sequence database for the organism sequences matching the unknown sequence in our read
# It returns many possible alignments ranked by the evalue, that is similar to the p-value estimating the probability of chance alignment
blastn -db 16S_ribosomal_RNA -query read1.fasta

BLASTN 2.14.1+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: 16S ribosomal RNA (Bacteria and Archaea type strains)
           26,910 sequences; 39,006,255 total letters



Query= SRR22276225.1 1 length=339

Length=339
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

NR_043061.1 Aminiphilus circumscriptus strain ILE-2 16S ribosomal...  161     2e-39
NR_025106.1 Paenibacillus brasilensis strain PB1 72 16S ribosomal...  150     5e-36
NR_133849.1 Rubrobacter aplysinae strain RV113 16S ribosomal RNA,...  111     2e-24


>NR_043061.1 Aminiphilus circumscriptus strain ILE-2 16S ribosomal RNA, partial 
sequence
Length=1416

 Score = 161 bits (87),  Expect = 2e-39
 Identities = 216/277 (78%), Gaps = 14/277 (5%)
 Strand=Plus/Plus

Query  73   CCAGCCGCC

In [None]:
# Different compact format of BLAST output
blastn -max_target_seqs 5 -db 16S_ribosomal_RNA -query read1.fasta -outfmt "6 qseqid ssciname qlen sacc slen evalue pident qcovs length"

SRR22276225.1	Aminiphilus circumscriptus	339	NR_043061	1416	2.41e-39	77.978	79	277
SRR22276225.1	Paenibacillus brasilensis	339	NR_025106	1384	5.22e-36	77.465	80	284
SRR22276225.1	Rubrobacter aplysinae	339	NR_133849	1512	2.46e-24	74.912	80	283


In [None]:
# Query all reads in the downloaded fasta file
# This operation may take a long minutes. You are allowed to stop the cell
# The output of BLAST is written to a file named blast_output
# Blastn -help to find out about parameters
blastn -max_hsps 1 -perc_identity 0.99  -qcov_hsp_perc 0.98  -max_target_seqs 3 -db 16S_ribosomal_RNA -query SRR22276225.fasta -outfmt "6 qseqid ssciname qlen sacc slen evalue pident qcovs" -out blast_output




In [None]:
# Examine the BLAST output in the file blast_output
head -n20 blast_output

SRR22276225.1	Aminiphilus circumscriptus	339	NR_043061	1416	2.41e-39	77.978	79
SRR22276225.1	Paenibacillus brasilensis	339	NR_025106	1384	5.22e-36	77.465	80
SRR22276225.1	Rubrobacter aplysinae	339	NR_133849	1512	2.46e-24	74.912	80
SRR22276225.2	Fusobacterium nucleatum subsp. nucleatum ATCC 25586	357	NR_114702	1479	3.81e-117	93.197	80
SRR22276225.2	Fusobacterium nucleatum subsp. nucleatum ATCC 25586	357	NR_074412	1502	3.81e-117	93.197	80
SRR22276225.2	Fusobacterium nucleatum subsp. nucleatum ATCC 25586	357	NR_117287	1442	3.81e-117	93.197	80
SRR22276225.3	Fusobacterium polymorphum ATCC 10953	248	NR_117842	1380	7.52e-83	98.286	70
SRR22276225.3	Fusobacterium vincentii ATCC 51190	248	NR_117841	1400	7.52e-83	98.286	70
SRR22276225.3	Fusobacterium nucleatum subsp. nucleatum ATCC 25586	248	NR_114702	1479	7.52e-83	98.286	70
SRR22276225.4	Fusobacterium nucleatum subsp. nucleatum ATCC 25586	529	NR_114702	1479	5.38e-167	92.683	77
SRR22276225.4	Fusobacterium nucleatum subsp. nucleatum ATCC 25586	529

In [None]:
# Examine the output that blast program created
# This output is a snapshot from a different sample

SRR22276203.1	Dermabacter vaginalis	526	NR_148832	1450	9.53e-120	87.129	72
SRR22276203.1	Dermabacter jinjuensis	526	NR_149775	1390	9.60e-115	86.386	72
SRR22276203.1	Dermabacter hominis	526	NR_026271	1511	9.60e-115	86.386	72
SRR22276203.2	Brachybacterium faecium DSM 4810	411	NR_074655	1520	1.22e-122	93.939	70
SRR22276203.2	Brachybacterium massiliense	411	NR_179438	1483	1.22e-122	93.939	70
SRR22276203.2	Brachybacterium faecium DSM 4810	411	NR_119205	1513	1.22e-122	93.939	70
SRR22276203.4	Brachybacterium aquaticum	281	NR_152653	1423	1.43e-85	93.897	74
SRR22276203.4	Brachybacterium paraconglomeratum	281	NR_025502	1472	1.43e-85	93.897	74
SRR22276203.4	Brachybacterium alimentarium	281	NR_026269	1512	1.43e-85	93.897	74
SRR22276203.5	Brachybacterium paraconglomeratum	465	NR_025502	1472	1.32e-162	92.611	85


#### Quantification of species
#### Think of , YOU DON'T HAVE TO IMPLEMENT IT, an algorithm and Pyhon code that you may write to quantify species in the BLAST results.

# Computational Laboratory Task:
### Make a necessary setup on your local machine
### Obtain sample_3 from the Short Reads Archive and make a query for all sequences against BLAST 16S_ribosomal_RNA database as in cell [35]
### Display 30 results from the file
### Answer a question: Approximately what percent identity and alignment lengths you see in these results?  

<a href="https://colab.research.google.com/github/erinijapranckeviciene/MLdata/blob/main/Blobs_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# what steps you would take to count how many organisms matched the query and in what abundance
# You can use Python to write a small script
import numpy as np
import pandas pd
