<a href="https://colab.research.google.com/github/daniilprigozhin/ProteinFamily/blob/main/Protein_Family_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Protein family analysis in Google Colab

This project will show you how to construct protein family phylogeny using NLR family as an example. The principal advantage of colab is ability to follow along and modify code as needed. We have implemented similar pipelines using Snakemake for execution on local machines.


##Colab basics
To run a section of code
* Hit play button OR
* Hit Cmd/Cntrl + Enter

To edit a section of code/text
* Double click the code/text window

##Step 0: Install the software 
We'll be using

HMMER with easel tools: 
http://hmmer.org

Prank: 
http://wasabiapp.org/software/prank/

Belvu (not in Colab): 
https://www.sanger.ac.uk/resources/software/seqtools/

RAxML: 
https://cme.h-its.org/exelixis/web/software/raxml/


In [None]:
##This block takes ~4 minutes
!pip install -q condacolab
import condacolab
condacolab.install()
!conda install -c bioconda hmmer 
!conda install -c bioconda easel
#!conda install -c bioconda::snakemake
!conda install -c bioconda raxml-ng
#!conda update raxml-ng


##Step 1: Load Proteome and Domain Model 

###Load proteome of your species of interest 
Here we will use protein models from Van de Weyer et al, 2019 Cell.


**Phytozome** is a good source for plant protein models. 
Example: rice
https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Osativa
Click to Bulk data download, login in and you will find yourself at FTP
You can find proteome under /annotation/Osativa_323_v7.0.protein.fa.gz 

**Uniprot** has convenient one protein per gene proteomes available for download for "all things bright and beautiful".

To load proteomes into colab one can either place them in **GitHub** and use git clone, or use the Files -> Upload to Session Storage (click **folder** icon on the left and then click **paper with up arrow** icon).

In [1]:
%rm -rf ProteinFamily/
!git clone https://github.com/daniilprigozhin/ProteinFamily.git
!ls

Cloning into 'ProteinFamily'...
remote: Enumerating objects: 129, done.[K
remote: Counting objects:   0% (1/129)[Kremote: Counting objects:   1% (2/129)[Kremote: Counting objects:   2% (3/129)[Kremote: Counting objects:   3% (4/129)[Kremote: Counting objects:   4% (6/129)[Kremote: Counting objects:   5% (7/129)[Kremote: Counting objects:   6% (8/129)[Kremote: Counting objects:   7% (10/129)[Kremote: Counting objects:   8% (11/129)[Kremote: Counting objects:   9% (12/129)[Kremote: Counting objects:  10% (13/129)[Kremote: Counting objects:  11% (15/129)[Kremote: Counting objects:  12% (16/129)[Kremote: Counting objects:  13% (17/129)[Kremote: Counting objects:  14% (19/129)[Kremote: Counting objects:  15% (20/129)[Kremote: Counting objects:  16% (21/129)[Kremote: Counting objects:  17% (22/129)[Kremote: Counting objects:  18% (24/129)[Kremote: Counting objects:  19% (25/129)[Kremote: Counting objects:  20% (26/129)[Kremote: Counting objects:  21


###Load a statistical model for your domain of interest

Since all NLRs have a conserved NB-ARC domain, you can extract proteins containing this domain’s HMM from plant proteome and align them using HMM as a template. Go to http://pfam.xfam.org/family/NB-ARC 
and download http://pfam.xfam.org/family/PF00931/hmm

To load this hmm file into colab:

In [None]:
!wget -O NB-ARC.hmm http://pfam.xfam.org/family/PF00931/hmm 


####What is HMM?
If you are curious - look inside the .hmm file to see how the domain is described as a statistical model of aa probabilities at each position of the domain.
####What if you need a new HMM?
What if your protein of interest does not have pre-built HMM describing it? You can build HMM yourself using `hmmbuild` function in HMMER.

What if HMM at Pfam does not adequately describe your protein family of interest. For example, the NB-ARC model above has been built from diverse organisms including not only plants but also animals, bacteria and archaea. Therefore the resulting HMM is a best fit to describe full diversity of this protein family.

Most of plant NLRs have a conserved NB-ARC domain that is larger than what Pfam HMM describes. NB-ARC of plant NLRs include additional motifs such as ARC2 and MHD. Therefore, we built plant specific HMM that you can download here: 

Bailey et al, Genome Biology 2018, Additional file 16:

In [3]:
!wget -O pbNB-ARC.hmm https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1392-6/MediaObjects/13059_2018_1392_MOESM16_ESM.hmm

--2021-10-01 06:44:42--  https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1392-6/MediaObjects/13059_2018_1392_MOESM16_ESM.hmm
Resolving static-content.springer.com (static-content.springer.com)... 151.101.0.95, 151.101.64.95, 151.101.128.95, ...
Connecting to static-content.springer.com (static-content.springer.com)|151.101.0.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 158213 (155K) [application/octet-stream]
Saving to: ‘pbNB-ARC.hmm’


2021-10-01 06:44:42 (6.66 MB/s) - ‘pbNB-ARC.hmm’ saved [158213/158213]



You can do the alignment steps below with both Pfam HMM and our HMM and compare the results.

As a bonus, you can download any of the curated functionally annotated NLRs from http://prgdb.crg.eu/wiki/Category:Reference_R-Genes,_manually_curated and include them in alignment and phylogeny. Place them in project folder as well.


##Step 2: Align proteins to Model

We will use HMMER to align proteins to model:  

    hmmsearch -E 1e-5 -A <domain.hmmalign.sto> <domain>.hmm <proteome>.faa
                                                  

In [None]:
!hmmsearch -E 1e-5 -A pbNB-ARC.hmmalign.sto --domtblout pbNB-ARC.hmmalign.tbl ProteinFamily/HMM_models/pbNB-ARC.hmm ProteinFamily/Proteomes/108.aa.fa
!cat pbNB-ARC.hmmalign.sto

Hmmalign produces an alignment in Stockholm format, however for visualisation and tree building we need the alignment in fasta format. 

`esl-alimask --rf-is-mask` removes columns that do not match model

`esl-alimanip --lmin` removes rows that are shorter than a user-defined threshold (in this case 237aa = 70% of the model length for our custom NB-ARC HMM)

`esl-reformat` reformats to fasta format. The same tool can also trim the alignment to remove insertions and short sequences.

`-` in easel signals that input to the command will come in from the pipe

Finally `cut` and `tr` remove extra fields in the protein names

In [None]:
!esl-alimask --rf-is-mask pbNB-ARC.hmmalign.sto | esl-alimanip --lmin 100 -|esl-reformat afa - |cut -d ' ' -f 1 |tr -d ' ' > pbNB-ARC.hmmalign.afa
!cat pbNB-ARC.hmmalign.afa

##Step 3: Phylogeny with RAXML

We are now ready to build a tree of the protein domains to visualise how they may be related evolutionarily. For this we are going to use the RAXML to build a bootstrapped maximum likelihood tree. This will take >2 hours but will actually work! Skip ahead to load precomputed results.

In [None]:
!raxml-ng --all --bs-trees 100 --model JTT --prefix pbNB-ARC --msa pbNB-ARC.hmmalign.afa 


RAxML-NG v. 1.0.3 released on 21.07.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Xeon(R) CPU @ 2.20GHz, 1 cores, 12 GB RAM

RAxML-NG was called at 01-Oct-2021 06:48:07 as follows:

raxml-ng --all --bs-trees 100 --model JTT --prefix pbNB-ARC --msa pbNB-ARC.hmmalign.afa

Analysis options:
  run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
  start tree(s): random (10) + parsimony (10)
  bootstrap replicates: 100
  random seed: 1633070887
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: coarse-grained (auto), NONE/sequential

[00:00:00] Reading a

In [None]:
## If you need to get the precomputed tree, unquote and run this line:
#!cp ProteinFamily/Colab_Results/pbNB-ARC.raxml.* .
!ls

##Step 4: Saving Results
You can connect to ***your own*** Google Drive and save any results you'd like to keep. 

In [None]:
!ls
from google.colab import drive
drive.mount('/content/drive')
!cp pbNB-ARC* /content/drive/Colab_Results/

In [16]:
!ls /content/drive/MyDrive/Colab_Results/

pbNB-ARC.hmmalign.afa	       pbNB-ARC.raxml.lastTree.TMP
pbNB-ARC.hmmalign.sto	       pbNB-ARC.raxml.log
pbNB-ARC.hmmalign.tbl	       pbNB-ARC.raxml.mlTrees
pbNB-ARC.noins.sto	       pbNB-ARC.raxml.mlTrees.TMP
pbNB-ARC.raxml.bestModel       pbNB-ARC.raxml.rba
pbNB-ARC.raxml.bestTree        pbNB-ARC.raxml.reduced.phy
pbNB-ARC.raxml.bootstraps      pbNB-ARC.raxml.startTree
pbNB-ARC.raxml.bootstraps.TMP  pbNB-ARC.raxml.support
pbNB-ARC.raxml.ckp


##Step 5: The fun part - annotating your tree
In this section we will annotate Pfam domains in our proteins of interest and will call a couple of R scripts to produce an annotation file for iTOL tree viewer. The motivation is to check what other domains are present in our proteins of interest (that all share the NB-ARC domain). 

Step 5.1: Load Pfam (supplement with your favorite domains using `cat`). Prepare the local pfam for running.

In [None]:
!wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
!gunzip Pfam-A.hmm.gz
!cat P  
!hmmpress Pfam-A.hmm

Step 5.2: use hmm-search to Pfam find domains in 

In [None]:
!esl-sfetch --index ProteinFamily/Proteomes/108.aa.fa
!grep '>' pbNB-ARC.hmmalign.afa|cut -f 1 -d '/'|tr -d '>' >108.pbNB-ARC.list
!cat 108.pbNB-ARC.list|esl-sfetch -f ProteinFamily/Proteomes/108.aa.fa - > 108.pbNB-ARC.fulllength.fa
!head 108.pbNB-ARC.fulllength.fa

In [None]:
!hmmsearch --domtblout 108.pbNB-ARC.Pfam.tbl Pfam-A.hmm 108.pbNB-ARC.fulllength.fa

In [None]:
!head pbNB-ARC.hmmalign.tbl
!head 108.pbNB-ARC.Pfam.tbl
!cat pbNB-ARC.hmmalign.tbl 108.pbNB-ARC.Pfam.tbl >108.pbNB-ARC.Pfam.joined.tbl

In [45]:
!cp pbNB-ARC* /content/drive/MyDrive/Colab_Results/
!cp 108* /content/drive/MyDrive/Colab_Results/

In [52]:
!Rscript scripts/reduce_Pfam.R -i 108.pbNB-ARC.Pfam.joined.tbl -o 108.pbNB-ARC.Pfam.reduced.tbl -e 1e-3 -f 0.3 -a 10


Fatal error: cannot open file 'scripts/reduce_Pfam.R': No such file or directory


In [49]:
!Rscript scripts/DomainDiagrams.R -i 108.pbNB-ARC.Pfam.reduced.tbl -f .fasta -a .afa


Fatal error: cannot open file 'scripts/DomainDiagrams.R': No such file or directory
