<h1><span style="color:red">Please read this very carefully! </span></h1>

In order to setup your own experiments, you need to download remote files to your linux disk image in the collaboratory environment. As data for your user account is NOT reset when you close or reload the HBP, you have to be very careful how you organize & structure your data. In order to help you with that we create a unique working directory for each molecular use case you run.

Please be also aware that we switch current working directories in this use case. That means that you have to restart and clear all output in order to go back to your starting directory. 

## Compare the electrostatic potentials surrounding a set of protein isoforms with multipipsa


**Aim:** This use case shows how to use the multipipsa tool to calculate the electrostatic potentials surrounding a set of protein isoforms in aqueous solution, then cluster the isoforms by their electrostatic similarity.

**Version:** 1.1 (January 2020)

**Contributors:**  Neil Bruce, Lukas Adam, Stefan Richter, Rebecca Wade (HITS, Heidelberg, Germany)

**Contact:** [mcmsoft@h-its.org](mailto:mcmsoft@h-its.org)

**Note:** This notebook has graphical output using nglview. If you use the "RunAll" function of the notebook, this graphical output might not appear on your screen. The cell defined to show the output must be visible in the browser during execution.

## Setting up your environment

### Check that all required python packages are installed and working

In [None]:
! pip install --upgrade pip
! pip uninstall --yes numpy
! pip uninstall --yes pandas
! pip install pandas>=1.0.1
! pip install numpy>=1.16

In [None]:
# Check that required packages are installed
! pip install --upgrade "hbp-service-client" 
! pip install wget python-magic
! pip install rpy2==2.9.1
! pip install setuptools
! pip install --extra-index-url https://projects.h-its.org/pypi multipipsa==4.0.10
! pip install nglview
! mkdir -p ~/.R/lib
! grep -qxF 'R_LIBS_USER=~/.R/lib/' ~/.Renviron || echo 'R_LIBS_USER=~/.R/lib' >> ~/.Renviron
! wget -c https://cran.r-project.org/src/contrib/fastcluster_1.1.25.tar.gz
! wget -c https://cran.r-project.org/src/contrib/heatmap3_1.1.7.tar.gz
! R CMD INSTALL -l ~/.R/lib fastcluster_1.1.25.tar.gz
! R CMD INSTALL -l ~/.R/lib heatmap3_1.1.7.tar.gz


In [None]:
# Import python packages/classes used in this notebook
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy
import rpy2
import os, wget, datetime, magic, inspect
from multipipsa.multipipsa import PipsaRun, ApbsRun
from multipipsa.clusterpipsa import ClusterPipsa
from multipipsa.pipsatypes import DistanceType
from PIL import Image
from hbp_service_client.storage_service.client import Client
import nglview

### Set up local directory structure

In [None]:
# Create a local working directory
try:
    homeDir = os.environ['HOME']
except:
    print("Error in environment")

else:
    workDir = os.path.join(homeDir, 'work')
    if not os.path.isdir(workDir):
        try:
            os.mkdir(workDir)
        except:
            print("unable to make working directory")
    
    # Make a new directory to run the use case in. 
    # If directory already exists, add a number to make a unique name
    baseDir = 'wholePIPSA'
    dirIter = 0
    useCaseDir = os.path.join(workDir, baseDir)
    print(useCaseDir)
    
    if os.path.exists(useCaseDir):
        while os.path.exists(useCaseDir):
            dirIter += 1
            useCaseDir = os.path.join(workDir, baseDir + '.' + str(dirIter))            
    
    try:
        os.mkdir(useCaseDir)
    except:
        print("Failed to make use case working directory")
    else:
        print("Working directory for current use case: %s" % useCaseDir)


### Set up collab storage for saving data at end of calculation

In [None]:
#Find your own collab storage path
collab_path = get_collab_storage_path()
print(collab_path)
storage_client = Client.new(oauth.get_token())

# Compare the electrostatic potentials surrounding a set of protein isoforms with multipipsa

This use case describes the use of the [multipipsa](https://collab.humanbrainproject.eu/#/collab/19/nav/2108?state=software,multipipsa) software tool to compare the electrostatic potentials of a set of similar proteins. In addition to its own functions, multipipsa makes use of the following open source tools:

* [PDB2PQR](https://apbs-pdb2pqr.readthedocs.io/en/latest/pdb2pqr/index.html): A tool that takes a protein structure in [PDB format](http://www.wwpdb.org/documentation/file-format), adds missing hydrogen atoms, and creates a structure file in PQR format. The PQR file format is derived from the PDB format for describing atomic data, but with the occupancy and temperature factor fields replaced with atomic partial charges and radii.

* [APBS](https://apbs-pdb2pqr.readthedocs.io/en/latest/apbs/index.html): A tool that calculates electrostatic potentials through solution of the Poisson-Boltzmann equation, one of the most common continuum models for describing electrostatic interactions between molecular solutes in salty, aqueous media. 


This use case also makes use of the [NGL View](https://github.com/arose/nglview) python package for displaying molecular data. NGL View provides IPython widgets for displaying molecular data inside notebooks, using the [NGL Viewer](http://nglviewer.org/) WebGL molecular viewer.


The steps taken in this use case are:

* Calculate the electrostatic potentials surrounding a set of similar proteins.
* Use PIPSA analysis to calculate the pairwise similarities of the proteins' electrostatic potentials.
* Cluster the proteins according to this similarity.


For more detailed information on the first step, see the other molecular use case [Calculate the electrostatic potential of a protein from its atomic structure](https://collab.humanbrainproject.eu/#/collab/1655/nav/362934). Instead of comparing the whole electrostatic potentials of the proteins, the potential in a specific region can also be compared. For an example of this see the molecular use case [Compare a specific region of the electrostatic potentials surrounding a set of protein isoforms with multipipsa](https://collab.humanbrainproject.eu/#/collab/1655/nav/362934).

## PIPSA analysis

PIPSA provides a method for quantitatively comparing the three dimensional interaction property fields of a set of structurally similar proteins. The structures of the proteins must be suitably aligned for the comparison to be reasonable. The method was first described in [Blomberg et al (1999)](https://doi.org/10.1002/&#40;SICI&#41;1097-0134&#40;19991115&#41;37:3%3C379::AID-PROT6%3E3.0.CO;2-K). In this use case, the interaction  property we consider is the electrostatic potential.

In the PIPSA method, interaction potential fields are discretised on three dimensional grids, and compared at grid points lying within a skin surrounding the proteins. The skin of the protein is defined as the region begining at a **distance $\sigma$** from the van-der-Waals surface of the protein's atoms with a **thickness $\delta$**. Commonly used parameters for this skin are $\sigma = 3$ Angstrom and $\delta = 4$ Angstrom. The region to be compared can be further restricted as the intersection of this skin with a sphere or as cone (as shown in the figure below). 

![PIPSA](https://projects.h-its.org/mcmsoft/pipsa/3.2/f4skin.gif)


### Similarity measure

The interaction fields of two structures can be compared quantitatively by calculating pairwise similarity indices. One such measure is the Hodgkin index, which is given by 
$$ SI_{12} = \frac{2(p_{1},p_{2})}{(p_{1},p_{1})+(p_{2},p_{2})} $$

where $ (p_{1},p_{2}) $ is the scalar product of the fields surrounding proteins $1$ and $2$ in the analysis region. They can be calculated as: 

$$ (p_{1},p_{2}) = \sum_{i,j,k}\phi_{1} \left ( i,j,k \right )\phi_{2} \left ( i,j,k \right ) $$

where $i$, $j$ and $k$ are the three-dimensional spatial coordinates, and $\phi_{1} \left ( i,j,k \right )$ is the potential at point $\left ( i,j,k \right )$ on the grid for the template protein. The similarity index runs from $-1$, for completely anti-correlated potentials, to $1$ for identical potentials. At $0$, there is no correlation between the two potentials. The pairwise similarity can be converted to a distance measure, for use in clustering, using

$$ D_{12} = \sqrt{2 - 2SI_{12}} $$






## Downloading the protein structures

In this use case, we use as our input structures the catalytic domain of nine isoforms of the enzyme adenylyl cyclase (AC1 to AC9), modelled during the work described in [Tong et al (2016)](https://doi.org/10.1002/prot.25167). The following cell downloads these structures from the CSCS storage area.

In [None]:
# Download isoform structure files from CSCS storage for calculation

# Loop to download AC1 - 9 structures
for iso in range(1, 10):

    try:
        print("Downloading AC%d structure file from CSCS storage area" % iso)
        try:
            fileUrl= 'https://object.cscs.ch/v1/AUTH_c0a333ecf7c045809321ce9d9ecdfdea/SGA2_molecular_models/data/Modelled_adenylyl_cyclase_AC_isoform_structures/refined/AC' + str(iso) + '.pdb'
        except:
            print("Error defining file url")
        else:
             wget.download(fileUrl, useCaseDir)
    except:
        print("Error downloading structure file AC%d CSCS storage" % iso)
        print(fileUrl)
    else:
        print("Sucessfully downloaded the structure file AC%d from CSCS storage" % iso)

### Viewing the protein structures
The following cell creates a molecular viewer to visualise the structures of the AC isoforms. The catalytic domain of AC is a dimer consisting of two protein chains. In the full structure of AC these two chains are connected by a series of transmembrane helices that anchor the protein in the post-synaptic membrane.

In [None]:
# View the downloaded structure
# Create a NGL widget object
viewPDB = nglview.NGLWidget()
# Set the display size
viewPDB._remote_call('setSize', target='Widget', args=['600px','400px'])

# List of colours to colour each residue differently
colorsA=[0xfff5eb,0xfee6ce,0xfdd0a2,0xfdae6b,0xfd8d3c,0xf16913,0xd94801,0xa63603,0x7f2704]
colorsB=[0xf7fcf5,0xe5f5e0,0xc7e9c0,0xa1d99b,0x74c476,0x41ab5d,0x238b45,0x006d2c,0x00441b]

         
# Create list for storing structure components for each isoform. 
# First item set to None so list index matches isoform number 
viewPDB_struct = [None]
# Loop over isoforms
for iso in range(1, 10):
    # Define files to load
    fname = 'AC' + str(iso) + '.pdb'
    AC_struct_file = nglview.FileStructure(os.path.join(useCaseDir, fname))

    # Create a component object for displaying the structure of the current isoform
    viewPDB_struct.append(viewPDB.add_component(AC_struct_file))
    #Clear default representation from the component and add cartoon representations for both chains
    viewPDB_struct[iso].clear_representations()
    viewPDB_struct[iso].add_representation('cartoon', sele=':A', color=colorsA[9-iso])
    viewPDB_struct[iso].add_representation('cartoon', sele=':B', color=colorsB[9-iso])

# Display the widget
viewPDB

The structures of the AC isoforms were created via homology modelling using the same template, so they are already aligned, as can be seen in the viewer above. The region where there are significant structural differences between the isoforms is in a flexible loop region that was not defined in the template structure. There are also variations in sequence length across AC isoforms in this region.

## Calculating the electrostatic potentials of the proteins

In the following cells, we calculate the electrostatic potentials of each AC isoform using the multipipsa software tool. For more information on this step, see the other molecular use case [Calculate the electrostatic potential of a protein from its atomic structure](https://collab.humanbrainproject.eu/#/collab/1655/nav/362934).

In [None]:
# Create a list of isoform structures
structures = ["AC1","AC2", "AC3", "AC4", "AC5", "AC6", "AC7", "AC8", "AC9"]
# Define the location of the PIPSA software exectutables
pipsaDir = os.path.join(os.path.dirname(inspect.getfile(PipsaRun)), 'data', 'pipsa')

In [None]:
# Create an ApbsRun instance for the current calculation
epCalc = ApbsRun(
                    dataDir=useCaseDir,    # Pass the use case work directory as the directory for running the calculation
                    pipsaRoot=pipsaDir,    # Pass the location of the PIPSA executables defined above
                    temp='298.15',         # Define the temperature in Kelvin
                    ios='0.100',           # Define the solvent ionic strength in Molar concentration
                    pH='7.4',              # Define the solvent pH
                    structures=structures  # Pass the list of structures defined above
                ) 

epCalc.runPdb2Pqr()
epCalc.runApbs()

## Performing PIPSA analysis

In this use case, we compare the electrostatic potentials of the AC isoforms within the whole skin surrounding the isoform structures. In the following cell we define a filename for our output results files, and initialise the calculation by creating an instance of the PipsaRun class in multipipsa.

In [None]:
# Define an string for the output image files
imageFilename="wholeStructure"

pipsaCalc = PipsaRun(pipsaRoot=pipsaDir,
                     dataDir=useCaseDir,
                     pointsTemplate='AC5')

The following cell calculates all pairwise similarities for the electrostatic potentials of the AC isoforms. These are then used to cluster the isoforms.

In [None]:
cluster = ClusterPipsa(structures=structures,
                       pipsaRoot=pipsaDir, 
                       dataDir=useCaseDir, 
                       distanceType=DistanceType.PIPSA,
                       graphicsFileRoot=imageFilename)

pipsaCalc.runClusterPipsa(structures=structures,
                          points=[],
                          cluster=cluster)

## Visualizing the electrostatic similarity across isoforms 

The PIPSA analysis above creates as output an image file showing the pairwise distances between AC isoforms as a 2D heatmap. The results are also clustered using a single linkage hierachical method. The resulting dendograms are shown along the edges of the heatmap. The following cell displays this image

In [None]:
Image.open(os.path.join(useCaseDir, 'wholeStructure_heat3_1.png'))

The clustering obtained here can be compared to that in Figure 2A of [Tong et al (2016)](https://doi.org/10.1002/prot.25167) (although an unrooted dendrogram is shown there). The electrostatic similarity also reproduces some of the known regulation groupings of AC isoforms (see Table 1 in [Tong et al (2016)](https://doi.org/10.1002/prot.25167)). For example, clustering AC2, AC4 and AC7 together, and AC5 and AC5 together.

## Saving your data to the collab storage area 
In the final cell, your data will be moved to the storage area for your collab, from where you can download your files, and the local working directory will be cleaned.

In [None]:
# Set up a timestamped directory name for saving results to the storage area
baseStorageDir = 'multipipsaWholePIPSA_'
timestamp = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
storageDir = os.path.join(collab_path, baseStorageDir + timestamp)
try:
    print('Creating storage directory: %s' % storageDir)
    storage_client.mkdir(storageDir)
except:
    print('There was an error creating the storage directory')
else:
    # Copy files to the storage area and remove the local files
    cleanDir = True
    for fName in os.listdir(useCaseDir):
        localFile = os.path.join(useCaseDir, fName)
        storageFile = os.path.join(storageDir, fName)
        fType = magic.Magic(mime=True).from_file(localFile)
        try:
            storage_client.upload_file(localFile, storageFile, fType)
        except:
            print('Error copying %s to storage' % fName)
            cleanDir = False
        else: 
            os.remove(localFile)
            
    print('All files in the working directory have been moved to the storage area directory:')
    print(storageDir)
    os.chdir(homeDir)
    if cleanDir:
        os.rmdir(useCaseDir)