# Introduction to PROVEAN and other mutation assessors

### There are several tools that can be used to assess the effect of a mutation and they can be compared as follows

<table bgcolor="white" border="1" cellspacing="0" style="text-align:right;" class="padded-table">

 <tbody><tr height="30" style="text-align:center">
  <th rowspan="2" width="80">Prediction tools</th>
  <th rowspan="2" width="90">Score thresholds</th>
  <th colspan="5">Human dataset</th>
  <th rowspan="2">References</th>
 </tr>

 <tr height="40">

  <td width="70">Sensitivity</td>
  <td width="70">Specificity</td>
  <td width="70"><b>Accuracy*</b></td>
    <td width="70"><b>Balanced accuracy**</b></td>
  <td width="70">No prediction</td>


 </tr>

 <tr height="40">
  <td style="text-align:center"><b>PROVEAN</b></td>
  <td>-4.100<br><span style="font-size: 7pt">(default)</span> -2.500<br>-1.300</td>

  <td>56.17<br>79.76<br>90.28</td>
  <td>90.34<br>78.63<br>61.65</td>
  <td>78.00<br>79.04<br>71.99</td>
  <td>73.25<br><b>79.19</b><br>75.96</td>
  <td>0</td>
  
  <td><a href="http://dx.plos.org/10.1371/journal.pone.0046688" target="_blank">Choi et al., 2012</a>; <a target="_blank" href="http://provean.jcvi.org">web</a></td>
 </tr>

 <tr height="40">
  <td style="text-align:center"><b>Mutation Assessor</b></td>
  <td>0.800<br>1.90</td>

  <td>96.54<br>85.29</td>
  <td>40.59<br>71.02</td>
  <td>60.90<br>76.20</td>
    <td>68.57<br><b>78.15</b></td>
  <td>317<br>(0.55%)</td>

  <td><a target="_blank" href="http://www.ncbi.nlm.nih.gov/pubmed/21727090">Reva et al., 2011</a>; <a target="_blank" href="http://mutationassessor.org">web</a></td>
 </tr>

 <tr height="40">
  <td style="text-align:center"><b>SIFT</b></td>
  <td>0.050</td>

  <td>85.03</td>
  <td>68.95</td>
  <td>74.77</td>
    <td><b>76.99</b></td>
  <td>1147<br>(1.99%)</td>

  <td><a target="_blank" href="http://www.ncbi.nlm.nih.gov/pubmed/19561590">Kumar et al., 2009</a>; <a target="_blank" href="http://sift.jcvi.org">web</a></td>
 </tr>

 <tr height="40">
  <td style="text-align:center"><b>PolyPhen-2</b></td>
  <td>0.432</td>

  <td>88.68</td>
  <td>62.45</td>
  <td>72.00</td>
    <td><b>75.56</b></td>
  <td>2279<br>(3.95%)</td>

  <td><a target="_blank" href="http://www.ncbi.nlm.nih.gov/pubmed/20354512">Adzhubei et al., 2010</a>; <a target="_blank" href="http://genetics.bwh.harvard.edu/pph2">web</a></td>
 </tr>

<tr height="40">
  <td style="text-align:center"><b>Condel web server</b></td>
  <td>0.469</td>

  <td>93.84</td>
  <td>46.23</td>
  <td>64.67</td>
    <td><b>70.04</b></td>
  <td>7194<br>(12.48%)</td>
  
  <td><a target="_blank" href="http://www.ncbi.nlm.nih.gov/pubmed/21457909">González-Pérez and López-Bigas, 2011</a>; <a target="_blank" href="http://bg.upf.edu/condel">web</a></td>
 </tr>

</tbody></table>

## Let's start with PROVEAN http://provean.jcvi.org/seq_submit.php

#### Upload example 1 and the amino acid variant will be ***P72R***. It can be noted that the input format for PROVEAN is a protein query sequence in FASTA format ie >p53 isoform a Homo sapiens MEEPQSDPSVEPPLSQETFSDLWKLLPENN...

#### The following table is how one would input the amino acid variant to analyze in PROVEAN

Steps
* go to PROVEAN, select for PROVEAN PROTEIN 
* enter the FASTA Amino Acid Sequence (to practice, click on upload example 1)
* enter the variant as per the input that will be described (only keep the P72R amino acid variant)
* submit the query and wait for results!

In [None]:
import pandas as pd

In [None]:
GP = pd.read_csv('DF_GEMPRO.csv', index_col=0)
# forcing gene IDs to be read as strings
GP['m_gene_original'] = GP['m_gene_original'].astype(str)
GP['m_gene_entrez'] = GP['m_gene_entrez'].astype(str)
GP['m_gene_isoform'] = GP['m_gene_isoform'].astype(str)
# this searches for an ID and prints out which row it is in
gene_id = raw_input("What is the uniprot ID?   ") # this can be modified to ask for gene original, entrez, uniprot, isoform id, refseq etc.


GP[GP.u_uniprot_acc == gene_id.upper()]

In [None]:
gene_original_id = raw_input("What is the gene original ID?   ")

In [None]:
seq = GP[GP.m_gene_original == gene_original_id].u_seq.values[0]

In [None]:
# outputs a fasta file format
def fasta(seq):
    fas_name = raw_input("What would you like to name this?   ")
    fas_out1 = '> '
    fas_out2 = '%s' %(fas_name)
    fas_out3 = '\n%s' %(seq)
    fas_out = fas_out1 + fas_out2 + fas_out3
    print "Copy and paste this sequence into: Step 1. Enter a protein sequence "
    print fas_out

In [None]:
fasta(seq)

<tbody><tr class="tableHeader"> 
			<td rowspan="2"><p>Type</p></td>
			<td colspan="2"><p>Format</p></td>
			<td rowspan="2"><p>Meaning</p></td>
			<td rowspan="2"><p>Variant Sequence</p></td>
		</tr>
		<tr class="tableHeader"> 
			<td><p>Comma-separated values</p></td>
			<td><p>HGVS notation</p></td>
		</tr>
		<tr class="tableRowOdd">
			<td><p>Single Amino Acid Substitution</p></td>
			<td><p>5,Q,A</p></td>
			<td><p>Q5A</p></td>
			<td><p>Q at position 5 is changed to A</p></td>
			<td><p>MEEPASDPSV</p></td>
		</tr>
		<tr class="tableRowEven">
			<td rowspan="2"><p>Deletion</p></td>
			<td><p>4,P,.</p></td>
			<td><p>P4<i>del</i></p></td>
			<td><p>P at position 4 is deleted</p></td>
			<td><p>MEEQSDPSV</p></td>
		</tr>
		<tr class="tableRowEven">
			<td><p>4,PQS,.</p></td>
			<td><p>P4_S6<i>del</i></p></td>
			<td><p>A deletion of three amino acids, from P at position 4 to S at position 6</p></td>
			<td><p>MEEDPSV</p></td>
		</tr>
		<tr class="tableRowOdd">
			<td rowspan="2"><p>Insertion</p></td>
			<td><p>7,D,DVA</p></td>
			<td><p>D7_P8<i>ins</i>VA</p></td>
			<td><p>VA is inserted between positions 7 and 8</p></td>
			<td><p>MEEPQSDVAPSV</p></td>
		</tr>
		<tr class="tableRowOdd">
			<td><p>6,S,SPQS<br>3,E,EPQS</p></td>
			<td><p>P4_S6<i>dup</i></p></td>
			<td><p>PQS is duplicated</p></td>
			<td><p>MEEPQSPQSDPSV</p></td>
		</tr>
		<tr class="tableRowEven">
			<td rowspan="2"><p>Insertion-deletion (Indel)</p></td>
			<td><p>7,DP,VA</p></td>
			<td><p>D7_P8<i>delins</i>VA</p></td>
			<td><p>DP is replaced by VA</p></td>
			<td><p>MEEPQSVASV</p></td>
		</tr>
		<tr class="tableRowEven">
			<td><p>3,E,QS</p></td>
			<td><p>E3<i>delins</i>QS</p></td>
			<td><p>E is replaced by QS</p></td>
			<td><p>MEQSPQSDPSV</p></td>
		</tr>
	</tbody>

In [1]:
mut_type = raw_input("What type of mutation is it (Single Amino Acid [SAA], Del, In, or Indel)?   ")
if mut_type.upper() == 'SAA':
	mut_loc_SAA = raw_input("Which amino acid position is it located in? ")
	mut_init = raw_input("Which amino acid was it initially? ")
	mut_out = raw_input("Which amino acid was it changed to? ")
	print "this is your input for Step 2: %s%s%s" %(mut_init, mut_loc_SAA, mut_out)
elif mut_type.upper() == 'DEL':
	mut_loc = raw_input("Which amino acid position was deleted? ")
	mut_init = raw_input("Which amino acid was it initially? ")
	print "this is your input for Step 2: %s%sdel" %(mut_init, mut_loc)
elif mut_type.upper() == 'IN':
	mut_insert = raw_input("Which amino acid(s) does the insertion consist of? ")
	mut_aa1 = raw_input("Which amino acid was it inserted behind of? ")
	mut_aa1_pos =raw_input("What AA position was it inserted in? ")
	mut_aa2 = raw_input("Which amino acid is the insertion in front of? ")
	mut_aa2_pos =raw_input("What AA position was it inserted behind of? ")
	print "this is your input for Step 2: %s%s_%s%sins%s" %(mut_aa1, mut_aa1_pos, mut_aa2, mut_aa2_pos, mut_insert)
elif mut_type.upper() == 'INDEL':
    mut_num_t = raw_input("Is there only one deletion or a range of deletions (answer with 'one' or 'range')   ")
    if mut_num_t.upper() == 'ONE':
        mut_del1 = raw_input("What is the AA that is being deleted?   ")
        mut_del1_pos = raw_input("What position is that AA in?   ")
        mut_indel = raw_input("What AAs are being inserted (can insert more than one, ie MKSS)   ")
        print "this is your input for Step 2: %s%sdelins%s" %(mut_del1, mut_del1_pos, mut_indel)
    elif mut_num_t.upper() == 'RANGE':
        mut_del1 = raw_input("What is the first AA that is being deleted?   ")
        mut_del1_pos = raw_input("What position is that AA in?   ")
        mut_del2 = raw_input("What is the last AA that is being deleted?   ")
        mut_del2_pos = raw_input("What position is that AA in?   ")
        mut_indel = raw_input("What AAs are being inserted (can insert more than one, ie MKSS)   ")
        print "this is your input for Step 2: %s%s_%s%sdelins%s" %(mut_del1, mut_del1_pos, mut_del2, mut_del2_pos, mut_indel)
    else:
        print "tbd"	
else:
	redo = raw_input("please input SAA, Del, In, or Indel")

What type of mutation is it (Single Amino Acid [SAA], Del, In, or Indel)?   indel
Is there only one deletion or a range of deletions (answer with 'one' or 'range')   range
What is the first AA that is being deleted?   M
What position is that AA in?   1
What is the last AA that is being deleted?   S
What position is that AA in?   6
What AAs are being inserted (can insert more than one, ie MKSS)   VSS
this is your input: M1_S6delinsVSS


### After inputting the mutation, P72R, this is the output we will get. PROVEAN will tell you if the mutation is neutral or deleterious 

## SIFT http://sift.bii.a-star.edu.sg/www/SIFT_seq_submit2.html 

## Let's take a look at Mutation Assessor http://mutationassessor.org/r3/

#### Let's continue with the same protein example we used in PROVEAN (link describes the protein http://www.uniprot.org/uniprot/P04637). This can be noted to only analyze for amino acid variations (no indels)

The server accepts list of variants, one variant per line, plus optional text describing your variants,
in genomic coordinates, "+" strand assumed :
  genome build,chromosome,position,reference allele,substituted allele
Genome build is optional (build 19 assumed), accepted values: 'hg19' and 'hg38'
Examples:
  hg38,13,32338418,G,T   BRCA2
  hg19,7,55211080,G,A   EGFR
  7,55211080,G,A   EGFR
 
or in protein space:
  protein ID variant text,
where protein ID can be :
- Uniprot protein accession (e.g. EGFR_HUMAN)
- NCBI Refseq protein ID (e.g. NP_005219)
Examples:
  EGFR_HUMAN R521K
  EGFR_HUMAN R98Q Polymorphism
  EGFR_HUMAN G719D disease
  NP_000537 G356A
  NP_000537 G360A dbSNP:rs35993958
  NP_000537 S46A Abolishes phosphorylation

The input is: P53_HUMAN P72R (corresponding to UNIPROT ID)

Submitting the query with the pre-selected options will give you this output, again, essentially telling you how "deleterious" they believe the mutation to be<img src="Mut_Ass.png">

Could also have this output/information if all options prior to submitting are selected <img src="Mut_Ass2.png">

## Let's take a look at PolyPhen-2 http://genetics.bwh.harvard.edu/pph2/

This is the input interface  <img src="PP2_1.png">

#### Let's continue with the same protein example we used in PROVEAN (link describes the protein http://www.uniprot.org/uniprot/P04637)

Input either the Protein identifier or the FASTA sequence <img src="PP2_2.png">

# Start of PDB Structural Visualization (PV Java)

In [None]:
# load Biopython PDB packages

# PDBList to download PDBs
from Bio.PDB.PDBList import PDBList
pdbl = PDBList()

# PDBParser to load and work with files
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()

import urllib2
import uuid

In [None]:
# download pdb
pdb_file_path = pdbl.retrieve_pdb_file('3BWM')

In [None]:
# open the downloaded file
# COMT is just the name of this structure - can be arbitrary
structure = parser.get_structure('COMT', pdb_file_path)

In [None]:
# get the ligands within this file for display
# from: http://stackoverflow.com/questions/25718201/remove-heteroatoms-from-pdb
ligands = []

for residue in structure.get_residues():
    tags = residue.get_full_id()
    # tags contains a tuple with (Structure ID, Model ID, Chain ID, (Residue ID))
    # Residue ID is a tuple with (*Hetero Field*, Residue ID, Insertion Code)

    # Thus you're interested in the Hetero Field, that is empty if the residue
    # is not a hetero atom or have some flag if it is (W for waters, H, etc.)
    if tags[3][0] != " " and tags[3][0] != "W":
        ligands.append(tags[3][0].split('_')[1].strip())
    else:
        continue
        
print(ligands)

In [None]:
class PDBViewer(object):
    '''
    Contributed by: Ali Ebrahim
    '''
    
    def __init__(self, f):
        self.pdb = open(f).read()

    def _repr_html_(self):
        div_id = str(uuid.uuid4())
        
        return """<div id="%s" style="width: 800px; height: 600px"><div>
        <!--script src="//biasmv.github.io/pv/js/pv.min.js"></script-->
        <script>
        require.config({paths: {"pv": "//biasmv.github.io/pv/js/pv.min"}});
        require(["pv"], function (pv) {
            pdb = "%s";
            structure = pv.io.pdb(pdb);
            viewer = pv.Viewer(document.getElementById('%s'),
                               {quality : 'medium', width: 'auto', height : 'auto',
                                antialias : true, outline : true});
            viewer.fitParent();
            var ligand = structure.select({rnames : %s});
            viewer.ballsAndSticks('ligand', ligand);
            viewer.cartoon('molecule', structure);
            viewer.centerOn(structure);
            
        });
        </script>
        """ % (div_id, self.pdb.replace("\n", "\\n"), div_id, ligands)

In [None]:
PDBViewer(pdb_file_path)

In [None]:
from pvviewer import PDBViewer

In [None]:
import sys
print (sys.version)