# CB2-101: Sequence Similarity Search
Malay (malay@uab.edu)

## Note

NCBI has recently removed the C blast programs from there website. Please download last C BLAST from our own copy to use the example in this handout.

In [None]:
wget "http://cmb.path.uab.edu/training/2018/files/blast-2.2.26.tar.gz"
tar -xvzf blast-2.2.26.tar.gz

Put blast in path.

In [None]:
cd blast-2.2.26/bin
export PATH=$PATH:`pwd`

Check that `blast` is in your path.

In [None]:
which blastall
cd ../..

Check that you are in outside of blast directory.

In [None]:
pwd

# Simple BLAST

BLAST p53 human sequence against SwissProt database.

## Download sequence files

Download SwissProt FASTA file file:

In [None]:
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

Find the id of P53 using the Uniprot API.

In [None]:
wget -q -O p53.fas http://www.uniprot.org/uniprot/P53_HUMAN.fasta


## Create the BLAST datbase and run BLASTP

Check the `formatdb` command help. 

In [None]:
formatdb -

In [None]:
zcat uniprot_sprot.fasta.gz | formatdb -i stdin -t "swissprot" -o T -n "swissprot"

Note how we used the compressed file in a pipe to create the blast database. Now we will run the actual blast to search the p53 gene in Swissprot database.

In [None]:
blastall -p blastp -i p53.fas -d swissprot -o output.bla

To get a different output format.

In [None]:
blastall -p blastp -i p53.fas -d swissprot -o output_tabular.bla -m 9

## Is P53 is present in yeast genome?

In this excerise you will download yeast genome and search the same p53 fasta agains yeast genome. The idea is to see whether p53 is present in yeast genome.



First let's create the SC proteome file.

In [None]:
# Download all the files of SC from NCBI
wget --quiet -O sc.faa.gz ftp://ftp.ncbi.nih.gov/genomes/refseq/fungi/Saccharomyces_cerevisiae/reference/GCF_000146045.2_R64/GCF_000146045.2_R64_protein.faa.gz


Now format the database as shown above and run the p53.fas file against this database. What is your observation?




## HMMER

```HMMER``` is the only known software for HMM use in bioinformatics. You can download HMMER from 

http://hmmer.janelia.org/software

There are quite a few software that are bundled with HMMER distribution. But the 4 most common ones are:

1. ```hmmsearch``` - Searches HMMs against protein sequences
2. ```hmmscan``` - Searches protein sequences against HMM library
3. ```hmmbuild``` - Builds HMM from a multiple alignment
4. ```hmmpress``` - Convert a flat file HMM to binary format that can be used with the software


We will search the SwissProt data with a profile of P53 gene.  

We will first get a bunch of orthologs of P53 from the Homologene database.



In [None]:
wget --quiet ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data

Now we will grep the file to find p53.

In [None]:
cat homologene.data | grep -i tp53 | head -n 10

Looks like cluster 460 contains T53. We will use a bit of shell script to get the accession.

In [None]:
cat homologene.data | grep "^460" | grep -w -i TP53 | cut -f 6 >p53_homologene_ids.txt
cat p53_homologene_ids.txt

We will now use NCBI eutils to extract those sesquence. Let's create a fasta file with this ids:

In [None]:
for i in `cat p53_homologene_ids.txt`
do 
wget -q -O - "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?\
db=protein&id=$i&rettype=fasta&retmode=text"
done >p53_all.fas

We will first use ```muscle``` to align those sequence and create the HMM, then search SC genome with it. First and put muscle in path.

In [None]:
wget https://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz
tar -xvzf muscle3.8.31_i86linux64.tar.gz
mv muscle3.8.31_i86linux64 muscle
which muscle

Download and prepare `hmmer`.

In [None]:
wget http://eddylab.org/software/hmmer/hmmer-3.3.tar.gz
tar -xvzf hmmer-3.3.tar.gz

Compile `hmmer`.

In [None]:
cd hmmer-3.3
./configure
make

Put `hmmer` in path.

In [None]:
current_dir=`pwd`
cd src
export PATH=$PATH:`pwd`
cd $current_dir
cd ..
pwd
which hmmbuild

Run muscle to create the alignment.

In [None]:
./muscle -in p53_all.fas -out p53.aln

Create the hmm.

In [None]:
hmmbuild --informat afa p53.hmm p53.aln

Search the hmm against swissprot database.

In [None]:
hmmsearch -o hits.txt p53.hmm uniprot_sprot.fasta.gz

Check the result.

In [None]:
cat hits.txt | head