<a href="https://colab.research.google.com/github/daniilprigozhin/ProteinFamily/blob/main/Protein_Family_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Protein family analysis in Google Colab

This project will show you how to construct protein family phylogeny using NLR family as an example. The principal advantage of colab is ability to follow along and modify code as needed. We have implemented similar pipelines using Snakemake for execution on local machines.


##Colab basics
To run a section of code
* Hit play button OR
* Hit Cmd/Cntrl + Enter

To edit a section of code/text
* Double click the code/text window

##Step 0: Install the software 
We'll be using

HMMER with easel tools: 
http://hmmer.org

Prank: 
http://wasabiapp.org/software/prank/

Belvu (not in Colab): 
https://www.sanger.ac.uk/resources/software/seqtools/

RAxML: 
https://cme.h-its.org/exelixis/web/software/raxml/


In [1]:
!pip install -q condacolab
import condacolab
condacolab.install()
!conda install -c bioconda hmmer 
!conda install -c bioconda easel
#!conda install -c bioconda::snakemake
!conda install -c bioconda raxml 

✨🍰✨ Everything looks OK!
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / done
Solving environment: \ | / - \ | / - \ | done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - done

# All requested packages already installed.




##Step 1: Load Proteome and Domain Model 

###Load proteome of your species of interest 
Here we will use protein models from Van de Weyer et al, 2019 Cell.


**Phytozome** is a good source for plant protein models. 
Example: rice
https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Osativa
Click to Bulk data download, login in and you will find yourself at FTP
You can find proteome under /annotation/Osativa_323_v7.0.protein.fa.gz 

**Uniprot** has convenient one protein per gene proteomes available for download for "all things bright and beautiful".

To load proteomes into colab one can either place them in **GitHub** and use git clone, or use the Files -> Upload to Session Storage (click **folder** icon on the left and then click **paper with up arrow** icon).

In [2]:
%rm -rf ProteinFamily/
!git clone https://github.com/daniilprigozhin/ProteinFamily.git

Cloning into 'ProteinFamily'...
remote: Enumerating objects: 86, done.[K
remote: Counting objects: 100% (86/86), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 86 (delta 58), reused 80 (delta 57), pack-reused 0[K
Unpacking objects: 100% (86/86), done.



###Load a statistical model for your domain of interest

Since all NLRs have a conserved NB-ARC domain, you can extract proteins containing this domain’s HMM from plant proteome and align them using HMM as a template. Go to http://pfam.xfam.org/family/NB-ARC 
and download http://pfam.xfam.org/family/PF00931/hmm

To load this hmm file into colab:

In [1]:
!wget http://pfam.xfam.org/family/PF00931/hmm 

--2021-09-24 06:05:30--  http://pfam.xfam.org/family/PF00931/hmm
Resolving pfam.xfam.org (pfam.xfam.org)... 193.62.193.83
Connecting to pfam.xfam.org (pfam.xfam.org)|193.62.193.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118379 (116K) [text/plain]
Saving to: ‘hmm’


2021-09-24 06:05:31 (316 KB/s) - ‘hmm’ saved [118379/118379]




####What is HMM?
If you are curious - look inside the .hmm file to see how the domain is described as a statistical model of aa probabilities at each position of the domain.
####What if you need a new HMM?
What if your protein of interest does not have pre-built HMM describing it? You can build HMM yourself using `hmmbuild` function in HMMER.

What if HMM at Pfam does not adequately describe your protein family of interest. For example, the NB-ARC model above has been built from diverse organisms including not only plants but also animals, bacteria and archaea. Therefore the resulting HMM is a best fit to describe full diversity of this protein family.

Most of plant NLRs have a conserved NB-ARC domain that is larger than what Pfam HMM describes. NB-ARC of plant NLRs include additional motifs such as ARC2 and MHD. Therefore, we built plant specific HMM that you can download here: 

Bailey et al, Genome Biology 2018, Additional file 16:

In [2]:
!wget https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1392-6/MediaObjects/13059_2018_1392_MOESM16_ESM.hmm

--2021-09-24 06:25:13--  https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-018-1392-6/MediaObjects/13059_2018_1392_MOESM16_ESM.hmm
Resolving static-content.springer.com (static-content.springer.com)... 151.101.0.95, 151.101.64.95, 151.101.128.95, ...
Connecting to static-content.springer.com (static-content.springer.com)|151.101.0.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 158213 (155K) [application/octet-stream]
Saving to: ‘13059_2018_1392_MOESM16_ESM.hmm’


2021-09-24 06:25:13 (6.61 MB/s) - ‘13059_2018_1392_MOESM16_ESM.hmm’ saved [158213/158213]



You can do the alignment steps below with both Pfam HMM and our HMM and compare the results.

As a bonus, you can download any of the curated functionally annotated NLRs from http://prgdb.crg.eu/wiki/Category:Reference_R-Genes,_manually_curated and include them in alignment and phylogeny. Place them in project folder as well.


##Step 2: Align proteins to Model

We will use HMMER to align proteins to model:  

    hmmsearch -E 1e-5 -A <domain.hmmalign.sto> <domain>.hmm <proteome>.faa
                                                    
Hmmalign produces an alignment in Stockholm format, however for visualisation and tree building we need the alignment in fasta format. ‘esl-reformat’ reformats to fasta format. The same tool can also trim the alignment to remove insertions.

    esl-reformat -o <domain>.hmmalign.fa afa <domain>.hmmalign.sto

Belvu alignment viewer 

The alignment can be downloaded and viewed using the program belvu and further refined to exclude gappy columns to trim it to the aligned region

Method
1.	belvu <*hmmalign.fasta> 2> <*hmmalign.fasta.log>


2.	Right click on the dash before the first region of high conservation, in the example I have marked the position with an arrow. 

3.	Edit -> Remove columns left of selection.... 

4.	Then right click on the residue flanking the last region, then use Edit -> Remove columns right of selection

Now we have our region of interest we can clean up the alignment 

5.	Edit remove gappy columns 
a.	90% gaps 
I have chosen 90 here as an arbitrary cut off but you can use more stringent cut-offs however as there is no easy undo in belvu if you remove too much of the variation in your sequences you may have to start refinement again. 

6.	Repeat the above but this time removing gappy sequences.  

Once you are happy with your curation then save. 
7.	File -> Save as <input>.belvu.fa


2.3 Phylogeny with RAXML

We are now ready to build a tree of the protein domains to visualise how they may be related evolutionarily. For this we are going to use the RAXML programme to build a bootstrapped maximum likelihood tree. 

Method
Prerequisites:
raxml


1.	raxmlHPC-SSE3 -f a -x 1123 -p 2341 -# 100 -m PROTCATJTT -s <input alignment>  -n <input alignment>.raxml

Example
~/biotools/standard-RAxML-master/raxmlHPC-SSE3 -f a -x 1123 -p 2341 -#100 -m PROTCATJTT -s athaliana.0.001.NB-ARC_hmmalign.belvu.fa -n athaliana.0.001.NB-ARC_hmmalign.raxml

-f 	    | rapid Bootstrap analysis and search for best-scoring ML tree in one program run
-x 	    | Random starting seed for bootstrapping
-# 	    |  Number of bootstraps 
-p 	    | Random starting seed for maximum likelihood tree 
-m            |  Model used to explain likelihood of amino-acid change 
-s 	    | Multiple sequence alignment 
-n 	    | Output file prefix 
-U	    | Skip gappy columns NB only use if have relatively few gaps in alignment 

2.4 Itol visualisation

As in level 1 guide but choose file RAxML_bipartitionsBranchLabels*.raxml

3. Extended functional analysis: 
3.1 Metabolic prediction through KEGG database

 Identify your protein of interest on the KEGG database 

https://www.genome.jp/kegg/genes.html

Your can enter your gene NCBI ID at the bottom of the page, alternatively, you can search with BLAST https://www.kegg.jp/blastkoala/

3.2 Structure modelling

3.2.2. Homology-based

Phyre2 predicts structures based on the following algorithm (https://www.nature.com/articles/nprot.2015.053)

1)	homologs are gathered for the entered query sequence
2)	homologs are aligned, and the secondary structure is predicted de novo
3)	HMM describing the sequence and its homologs as well as predicted secondary structure is built from the alignment
4)	HMM is scanned against a database containing HMMs of proteins with known structures
5)	Top scoring hits are used to construct backbone model, loops are modelled separately and then side chains of amino acids are placed.

http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index

Phyre2 workshop with explanation of results: http://www.sbg.bio.ic.ac.uk/phyre2/workshops/2019/Cambridge/worked_examples.html

If you have high confidence models, you can download and visualize them on your computer using Chimera software: https://www.cgl.ucsf.edu/chimera/

3.2.3 Ab initio structure prediction

You can also try ab initio structure prediction and compare it to results from Phyre2.
https://robetta.bakerlab.org

3.3 Overlaying amino acid diversity on structural model 

Chimera software: https://www.cgl.ucsf.edu/chimera/

Mapping sequence conservation on protein structure tutorial: https://www.cgl.ucsf.edu/chimera/data/tutorials/systems/outline.html
 




 
Appendix - Key terminology 

domain (of a protein)
a distinct functional and structural unit of a protein. Often, corresponds to a region of sequence conservation that can function, fold and evolve independently of the rest of the protein chain.

family (of proteins)
a group of proteins of the common evolutionary origin, that share similarity in function, sequence and domain structure.

Profile Hidden Markov Model (HMM)
a statistical model that turns a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. See Eddy SR Bioinformatics 1998 https://doi.org/10.1093/bioinformatics/14.9.755

homologs
sequences that share a common evolutionary origin. Beware of substituting sequence similarity with homology, the former can be expressed in % identity, the latter cannot be partial because two sequences are either homologous (share a common ancestor) or not. 

orthologs
sequences that share a common evolutionary origin and that were last separated by speciation

paralogs
sequences that share a common evolutionary origin and that were last separated by gene duplication

branch (of a phylogenetic tree) [includes “branch length”]
Lineages of taxonomic units that link nodes within a phylogenetic tree.
In a rooted tree, branches indicate direct transmission of genetic information from the taxonomic unit located at one end (parent) of the branch to the other (child). To determine the direction of transmission, consider removing the branch of interest from the tree to yield two unconnected subtrees, each of which contains only one of the nodes directly linked by the branch of interest. The subtree containing the root node contains the more ancestral of the two nodes linked by the branch of interest; hence, the direction of transfer of genetic information is from this more ancestral node to the other node linked by the branch of interest. For an unrooted tree, it is unknown which of the subtrees contains the root node, hence in unrooted trees the direction of transmission of genetic information is not specified.
If the branch is part of a scaled phylogenetic tree, then a value is associated with the branch that indicates some measure of the difference between the two taxonomic units directly linked by the branch; this value is often referred to as the "branch length". If the tree is "unscaled", no such value is associated with the branch, that is, no branch length is specified or defined.
A branch that links two internal nodes is known as an internal, inner, or interior branch. Branches linking an internal and an external node are referred to as external branches (also terminal branches).

branch length
See “branch”.

node (of a phylogenetic tree)
In phylogenetic trees, nodes represent taxonomic units. Nodes between which there is direct transfer of genetic information are linked to each other by branches. Nodes in a phylogenetic tree that are attached to only a single terminal branch are referred to as terminal nodes (also external nodes, leaves, or tips), and represent operational taxonomic units. Nodes attached to more than one branch are referred to as internal (also interior) nodes and represent hypothetical taxonomic units.

phylogenetic tree [includes “root”]
A description of a path of transmission of genetic information between a set of operational (and usually also hypothetical, if the tree contains any internal nodes) taxonomic units (see Operational Taxonomic Units (OTUs) and Hypothetical Taxonomic Units (HTUs)). Tree structures, as understood in graph theory, can be used to represent phylogenetic trees. In graph theory, trees are defined as undirected graphs for which exactly one path connects any two nodes (or "vertices"), i.e., a tree is any connected graph that does not contain any cycles. Phylogenetic trees consist of nodes (operational or hypothetical taxonomic units) that are connected via branches.
Phylogenetic trees are described as either "rooted" or "unrooted". In a rooted tree, there is one node (the "root node") that represents the most recent common ancestor of all other taxa in the tree. While it is assumed that such an ancestor also exists for an unrooted tree, i.e., that all the OTUs share a most recent common ancestor, in an unrooted tree no inference is made about where on the tree this HTU might be.

root
See “phylogenetic tree”

sister group
In a rooted bifurcating tree, any internal node represents an ancestor of two subtrees. These two subtrees are sometimes referred to as "sister groups" of each other, i.e., subtree A is the sister group of subtree B (and accordingly subtree B is the sister group of subtree A). As this definition depends on knowing the direction of genetic transmission along the branches of the tree (i.e., which nodes/branches are ancestral and which are descendant), it is only possible to identify sister groups in the context of a rooted phylogenetic tree; in unrooted trees the direction of transmission of genetic information is not specified, so it is not possible to identify which of the subtrees linked to an internal node are ancestral and which are descendant. To overcome this problem, the concept of adjacent groups was developed for referring to subtrees linked to the same internal node in unrooted trees.

subtree
A tree obtained by detaching a branch from a larger phylogenetic tree.

taxonomic unit
Phylogenetic trees can be used to describe patterns of genetic transmission between different kinds of entities, for example: different species, different individuals within a population of the same species, different genes within a gene family. The term "taxonomic unit" is used to refer to the entities between which patterns and paths of genetic transfer are described. Thus, for some trees, the taxonomic units will be individuals within a population, in other trees they will be different species.

Phylogeny definitions taken from entries written by Aidan Budd and Alexandros Stamantakis in the 2nd edition of the “Dictionary of Bioinformatics and Computational Biology” (eds John M. Hancock and Marketa J. Zvelebil)

In [4]:
%ls ProteinFamily/Proteomes/
!hmmsearch -o domainsofinterest.hmmalign.sth ProteinFamily/HMM_models/pbNB-ARC.hmm ProteinFamily/Proteomes/108.aa.fa
!head domainsofinterest.hmmalign.sth

10015.aa.fa  6981.aa.fa  7328.aa.fa  9533.aa.fa  9583.aa.fa  9837.aa.fa
108.aa.fa    7058.aa.fa  7373.aa.fa  9536.aa.fa  9597.aa.fa  9869.aa.fa
1925.aa.fa   7067.aa.fa  7396.aa.fa  9537.aa.fa  9600.aa.fa  9871.aa.fa
5784.aa.fa   7111.aa.fa  7413.aa.fa  9542.aa.fa  9610.aa.fa  9879.aa.fa
5993.aa.fa   7167.aa.fa  7415.aa.fa  9543.aa.fa  9654.aa.fa  9887.aa.fa
6899.aa.fa   7186.aa.fa  7416.aa.fa  9545.aa.fa  9669.aa.fa  9944.aa.fa
6906.aa.fa   7213.aa.fa  7417.aa.fa  9549.aa.fa  9721.aa.fa  9947.aa.fa
6909.aa.fa   7273.aa.fa  9100.aa.fa  9550.aa.fa  9762.aa.fa  Athaliana.NLR.afa
6924.aa.fa   7288.aa.fa  9134.aa.fa  9554.aa.fa  9764.aa.fa
6939.aa.fa   7308.aa.fa  9332.aa.fa  9557.aa.fa  9784.aa.fa
6974.aa.fa   7322.aa.fa  9518.aa.fa  9580.aa.fa  9792.aa.fa
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - 