<a href="https://colab.research.google.com/github/daniilprigozhin/ProteinFamily/blob/main/Protein_Family_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Protein Family Analysis in Google Colab

This project will show you how to construct protein family phylogeny using plant NLR immune receptor family as an example. The principal advantage of colab is ability to follow along and modify code as needed. We have implemented similar pipelines using Snakemake for execution on local machines. Try bringing in you own HMM and sequence database and see how far you can go. (As usual sequence names in .fasta files are a pain) 


##Colab basics
To run a section of code
* Hit play button OR
* Hit Cmd/Cntrl + Enter

To edit a section of code/text
* Double click the code/text window

##Step 0: Install the software 
We'll be using

HMMER with easel tools: 
http://hmmer.org


RAxML: 
https://github.com/amkozlov/raxml-ng

iToL:
https://itol.embl.de


In [None]:
##This block takes 4-5 minutes
!pip install -q condacolab
import condacolab
condacolab.install()
!conda install -c bioconda hmmer 
!conda install -c bioconda easel
!conda install -c bioconda raxml-ng


##Step 1: Load Proteome and Domain Model 

###Load proteome of your species of interest 
Here we will use protein models from [Van de Weyer et al, 2019 Cell](https://doi.org/10.1016/j.cell.2019.07.038).
The ProteinFamily/Proteome subdirectory of the Github repo below contains all proteomes from the **Arabidopsis pan-NLRome**, but we will only process proteins from one ecotype to speed things up.

**Phytozome** is a good source for plant protein models. 
Example: rice
https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Osativa
Click to Bulk data download, login in and you will find yourself at FTP
You can find proteome under /annotation/Osativa_323_v7.0.protein.fa.gz 

**Uniprot** has convenient one protein per gene proteomes available for download for "all things bright and beautiful".

To load proteomes into colab one can either place them in **GitHub** and use git clone, or use the Files -> Upload to Session Storage (click **folder** icon on the left and then click **paper with up arrow** icon). Also, see below for mounting your Google Drive into Colab.

In [None]:
%rm -rf ProteinFamily/
!git clone https://github.com/daniilprigozhin/ProteinFamily.git
!ls


###Load a statistical model for your domain of interest

Since all NLRs have a conserved NB-ARC domain, you can extract proteins containing this domain’s HMM from plant proteome and align them using HMM as a template. Go to http://pfam.xfam.org/family/NB-ARC 
and download http://pfam.xfam.org/family/PF00931/hmm

To load this hmm file into colab:

In [None]:
!wget -O NB-ARC.hmm http://pfam.xfam.org/family/PF00931/hmm 


####What is HMM?
If you are curious - look inside the .hmm file to see how the domain is described as a statistical model of aa probabilities at each position of the domain.
####What if you need a new HMM?
What if your protein of interest does not have pre-built HMM describing it? You can build HMM yourself using `hmmbuild` function in HMMER.

What if HMM at Pfam does not adequately describe your protein family of interest. For example, the NB-ARC model above has been built from diverse organisms including not only plants but also animals, bacteria and archaea. Therefore the resulting HMM is a best fit to describe full diversity of this protein family.

Most of plant NLRs have a conserved NB-ARC domain that is larger than what Pfam HMM describes. NB-ARC of plant NLRs include additional motifs such as ARC2 and MHD. Therefore, we built plant specific HMM that you can download here: 

Bailey et al, Genome Biology 2018, Additional file 16:

In [None]:
!wget -O pbNB-ARC.hmm https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1392-6/MediaObjects/13059_2018_1392_MOESM16_ESM.hmm

You can do the alignment steps below with both Pfam HMM and our HMM and compare the results.

As a bonus, you can download any of the curated functionally annotated NLRs from http://prgdb.crg.eu/wiki/Category:Reference_R-Genes,_manually_curated and include them in alignment and phylogeny. Place them in project folder as well.


##Step 2: Align proteins to Model

We will use HMMER to align proteins to model:  

    hmmsearch -E 1e-5 -A <domain.hmmalign.sto> <domain>.hmm <proteome>.faa
                                                  

In [None]:
!hmmsearch -E 1e-5 -A pbNB-ARC.hmmalign.sto --domtblout pbNB-ARC.hmmalign.tbl ProteinFamily/HMM_models/pbNB-ARC.hmm ProteinFamily/Proteomes/108.aa.fa
!cat pbNB-ARC.hmmalign.sto

Hmmalign produces an alignment in Stockholm format, however for visualisation and tree building we need the alignment in fasta format. 

`esl-alimask --rf-is-mask` removes columns that do not match model (insertions)

`esl-alimanip --lmin` removes rows that are shorter than a user-defined threshold (in this case 100aa = 30% of the model length for our custom NB-ARC HMM)

`esl-reformat` reformats to fasta format.

`-` in easel signals that input to the command will come in from the pipe

Finally `cut` and `tr` remove extra fields in the protein names

In [None]:
!esl-alimask --rf-is-mask pbNB-ARC.hmmalign.sto | esl-alimanip --lmin 100 -|esl-reformat afa - |cut -d ' ' -f 1 |tr -d ' ' > pbNB-ARC.hmmalign.afa
!cat pbNB-ARC.hmmalign.afa

##Step 3: Phylogeny with RAXML

We are now ready to build a tree of the protein domains to visualise how they may be related evolutionarily. For this we are going to use the RAXML to build a bootstrapped maximum likelihood tree. This will take >2 hours but will actually work! Skip ahead to the next code block to load precomputed results.

In [None]:
##Several hours depending on --lmin cutoff (i.e. number of sequences) in the previous step
!raxml-ng --all --bs-trees 100 --model JTT --prefix pbNB-ARC --msa pbNB-ARC.hmmalign.afa 

In [None]:
## If you need to get the precomputed tree, unquote and run this line:
!cp ProteinFamily/Colab_Results/pbNB-ARC.raxml.* .
!ls

##Step 4: Saving Results
You can connect to ***your own*** Google Drive and save any results you'd like to keep. 

In [None]:
!ls
from google.colab import drive
drive.mount('/content/drive')
!cp pbNB-ARC* /content/drive/MyDrive/Colab_Results/

In [None]:
!ls /content/drive/MyDrive/Colab_Results/

##Step 5: The fun part - annotating your tree
In this section we will annotate Pfam domains in our proteins of interest and will call a couple of R scripts to produce an annotation file for iTOL tree viewer. The motivation is to check what other domains are present in our proteins of interest (that all share the NB-ARC domain). 
There are many potential annotation tracks that iTOL supports, allowing you to bring in just about any kind of information on your system, such as gene expression, epigenetic states, gene structure, etc. 

##Step 5.1: Load Pfam 
We will start by loading the current version of Pfam. Then, we'll supplement it with two custom HMMs, one for NB-ARC, and one for an N-teminal coiled-coil domain that is typical of NLRs, but is not yet in Pfam.Prepare the local pfam for running.

In [None]:
##less than 2 minutes
!wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
!gunzip Pfam-A.hmm.gz
!cat Pfam-A.hmm ProteinFamily/HMM_models/* > Pfam-A.plus.hmm
!hmmpress Pfam-A.plus.hmm

##Step 5.2: find all other Pfam domains in your proteins of interest
Before we begin, we need to select the proteins of interest out of the available proteome. `grep '>'` gets the names from our .afa alignment file. `cut` removes the NB-ARC coordinates, `tr` removes '>', and `sort` and `uniq` take care of duplicates.
There is a perl script in here that's been perfect since the dawn of time. I tried to do this simple task with easel tools and failed.
The two `wc` commands check that the number of expected sequences matches the number retrieved.

In [None]:

!grep '>' pbNB-ARC.hmmalign.afa|cut -f 1 -d '/'| tr -d '>' |sort |uniq >108.pbNB-ARC.list
!wc 108.pbNB-ARC.list
!ProteinFamily/scripts/K-get_fasta_from_ids.pl -f ProteinFamily/Proteomes/108.aa.fa -i 108.pbNB-ARC.list > 108.pbNB-ARC.fulllength.fa
!grep '>' 108.pbNB-ARC.fulllength.fa|wc 

### Time to do the big domain search.
This runs very quickly for what it is. Domains in the database are alphabetical, which should give us a sense of progress.

In [None]:
##2-3 minutes
!hmmsearch --domtblout 108.pbNB-ARC.Pfam.tbl Pfam-A.plus.hmm 108.pbNB-ARC.fulllength.fa

###How many hits did we get?
Some of these overlap. We'll select the highest scoring hits in a later step.

In [None]:
!cat 108.pbNB-ARC.Pfam.tbl|grep -v '#'|wc

### Saving results to Drive

In [None]:
!cp pbNB-ARC* /content/drive/MyDrive/Colab_Results/
!cp 108* /content/drive/MyDrive/Colab_Results/

## Step 5.3 Remove weak hits that overlap better ones
The first `tr` command collapses spaces in hmmer output. Usually, in R read_delim() does this, but not in Colab for some reason.
There are many ways to run R code in colab including executing blocks in R, here we call R code as a standalone script.

On first execution R will be installing a bunch of libraries, which will take some time. This is a drawback of Colab, as on a local machine you'd only run this once. The working part of the script runs very quickly.

The options:
`-e` allows disregarding poorly scoring HMM hits by e-value,
`-f` allows disregarding HMM hits by model coverage,
`-a` sets a maximum allowed overlap as HMM's can be greedy and bite into neighboring territories.

In [None]:
!tr -s ' ' <108.pbNB-ARC.Pfam.tbl > 108.pbNB-ARC.Pfam.ws.tbl
!Rscript ProteinFamily/scripts/reduce_pfam.R -i 108.pbNB-ARC.Pfam.ws.tbl -o 108.pbNB-ARC.Pfam.reduced.tbl -e 1e-3 -f 0.3 -a 10

##Step 5.4 Produce the annotation track for iToL
R has libraries for drawing and annotating trees that are very good and getting better. iToL's main advantage is in the ease of sharing.

In [None]:
!Rscript ProteinFamily/scripts/DomainDiagrams_sm.R -o 108.iTOL.domains.txt -i 108.pbNB-ARC.Pfam.reduced.tbl -f 108.pbNB-ARC.fulllength.fa -a pbNB-ARC.hmmalign.afa

### Saving results to Drive

In [None]:
!cp pbNB-ARC* /content/drive/MyDrive/Colab_Results/
!cp 108* /content/drive/MyDrive/Colab_Results/

##Step 6 Display Tree with annotation
Use the Files menu on the left to download pbNB-ARC.raxml.support (this is our tree) and 108.iTOL.domains.txt (this is our annotation).


Go to [iToL](https://itol.embl.de), create an accont and upload your tree. Drag the annotation file over the opened tree view and look around.
Here is my copy: [Tree with annotation](https://itol.embl.de/tree/13124385225132441633160125).