Using Binder to populate a terminal environment that contains the alignment tools HMMER.
We recently finished annotating the Bryozoan Bugula neritina, which is hundreds of millions of years old, and very difficult to relate to other known genes in invertebrates.
Here is a picture of what a Bryozoan looks like. They colonize rocks in the marine environment.
Use MyD88: innate immune signal transduction adaptor
Innate immunity is a pathogenic defense mechanism in invertebrates. The genes can be highly conserved. MyD88 is a known signal transduction adaptor for toll-like receptors that triggers a pathogen-associated molecular pattern of response in other organisms. Innate immunity genes will allow organisms to recognize their invaders and fend them off or allow them to cohabitate.
Bryozoans have a population of bacterial symbionts, much like the relationship between zooxanthellae and coral. These microbial organisms are important for fighting cancer, as they produce a unique set of compounds known as bryostatins.
Understanding how the bryozoan allows these symbionts to colonize instead of fending them off is an interesting puzzle to solve.
The starting point in this tutorial is a set of protein models of MyD88 from other organisms.
-
protein sequences collected from UniProt (we will cover this in an upcoming tutorial)
-
Here is a paper that describes more about innate immunity genes such as MyD88.
We'll compare the ability to detect this protein using BLAST and HMMER
cd example_data/
ls
We have a set of 16 MyD88 proteins MyD88.faa
and the CANU-filtered Bugula neritina proteome Bugula.pep
Make a BLAST database
makeblastdb -in Bugula.pep -dbtype prot
BLAST MyD88 proteins against the B. neritina genome to see if we get a result.
blastp -query MyD88.faa -db Bugula.pep -outfmt 6 -evalue 1e-5 -out blastp.MyD88.Bugula.pep.outfmt6
This should be a rather short blast operation, because we only have a few sequences.
Did we get any results at all?
less blastp.MyD88.Bugula.pep.outfmt6
To exit from the "less" command, push lower case "q" on your keyboard.
CLICK ME
#### The file is empty
Let's try a probabilistic approach with HMMER (HMMER: biosequence analysis using profile hidden Markov models)
HMMER is a program that uses hidden Markov profiles to predict the probability that two proteins are related.
The first step is to give it a set of data for calculating probabilities.
There are two common sources for reference data sets for HMMER:
1.) precomputed hmm profiles in public databases (e.g. Pfam)
2.) manually created multiple sequence alignments
We are going to do approach #2 first.
We have a set of data already that we can create our own hmm profile from, the MyD88 set of data.
First we have to align the sequences. It is important to eyeball them at some point to make sure that they are not composed of non-overlapping, too long or too short sequences. It is better to be more conservative in the number of sequences rather than trying to include everything. Some databases contain mislabeled sequences that can confound your search for proteins.
muscle -in MyD88.faa -out MyD88.msa
Take a look inside the file to see if the alignments make sense. Kick out any sequences that are making the sequences less cohesive.
less -S MyD88.msa
There are two key parts of the alignment that suggest it might not be great quality - the string a gaps in most sequences at the start and end of the alignment. We can removed those sequences and re-run muscle.
rm MyD88.msa
nano MyD88.faa
Then find the sequences tr|M4I212|M4I212_CRAGI
and tr|A0A2B4RVP2|A0A2B4RVP2_STYPI
and use Ctrl+k
to cut the lines (header + sequence) for both sequences. Save and close. Re-run the multiple sequence alignment.
muscle -in MyD88.faa -out MyD88.msa
Then, we are going to use hmmbuild to analyze the occurrence of each peptide in relation to the position in the aligned proteins.
hmmbuild MyD88.hmm MyD88.msa
The final step to create a mathematical matrix that shows the probability of each transition based on the data provided.
hmmpress MyD88.hmm
ls
The MyD88.htm is now a searchable hmm profile that we can use with our genome to find proteins which are probably related.
hmmscan --domtblout Bugula.MyD88.domtblout MyD88.hmm Bugula.pep
Look at the results in the Bugula.MyD88.domtblout
less Bugula.MyD88.domtblout
Compare hmmscan to hmmsearch
hmmsearch --tblout Bugula.MyD88.tblout MyD88.hmm Bugula.pep
less Bugula.MyD88.tblout
Are there any potential matches to our hmm model?
Building HMM models for specific genes of interest can be time consuming. In this example, the sequences were collected and screened prior to the start of the tutorial. There are numerous HMM databases to switch to which can be used to annotate multiple proteins at once. Specific HMM databases will be covered in future lessons, but one example is Pfam
(myBinder does not want to download the full Pfam database, so we'll skip this part of the example)
hmmsearch --domtblout Bugula.Pfam.domtblout Pfam-A.hmm Bugula.pep
Things to consider:
1.) How much does the reference database influence our ability to annotate a genome?
2.) Which approach was more sensitive - local alignments with BLAST or hmm profile search with HMMER?
3.) What roles do these approaches play in a robust annotation pipeline?