# Preparation, protonation, building
## Toni Giorgino
Institute of Neurosciences  
National Research Council of Italy


# Abstract

This session will cover the steps **preliminary** to a simulation -- from a raw PDB file, to a set of files constituting a **runnable** system. The **system preparation** phase, based on the PDB2PQR and propKa softwares, addresses e.g. the problems of assigning  titration states at the user-chosen pH; flipping the side chains of HIS, ASN, and GLN residues; and optimizing the overall hydrogen bonding network. The **build** phase takes a prepared system and applies the chosen forcefield in order to obtain simulation-ready input files. This session provides an overview of the options available and feedback obtained during the preparation and building phases.

# Overview

This session will cover the steps **preliminary** to 
a simulation -- from a raw PDB file, to a set of
files constituting a **runnable** system.

Currently supported output formats:
*CHARMM* and *AMBER*.

We shall also deal with transmembrane domains.

<img src="img/overview.svg" style="width: 70%"/>

# Let's start

In [None]:
# %qtconsole
from htmd import *

# Part 1. Protein preparation

The system preparation phase is based on the PDB2PQR software. It 
includes the following steps (from the
[PDB2PQR algorithm
description](http://www.poissonboltzmann.org/docs/pdb2pqr-algorithm-description/)):

 * Compute empirical pKa values for the residues' local environment (propKa)
 * Assign titration states at the user-chosen pH;
 * Flipping the side chains of HIS (including user defined HIS states), ASN, and GLN residues;
 * Rotating the sidechain hydrogen on SER, THR, TYR, and CYS (if available);
 * Determining the best placement for the sidechain hydrogen on neutral HIS, protonated GLU, and protonated ASP;
 * Optimizing all water hydrogens.

The hydrogen bonding network calculations are performed by the
[PDB2PQR](http://www.poissonboltzmann.org/) software package. The pKa
calculations are performed by the [PROPKA
3.1](https://github.com/jensengroup/propka-3.1) software packages.
Please see the copyright, license  and citation terms distributed with each.

Note that this version was modified in order to use an 
externally-supplied propKa **3.1** (installed automatically via dependencies), whereas
the original had propKa 3.0 *embedded*!

The results of the function should be roughly equivalent of the system
preparation wizard's preprocessing and optimization steps
of Schrodinger's Maestro software.

<img src="img/naming.svg" style="width: 70%"/>

Modified residue names
----------------------

The molecule produced by the preparation modifies residue names
according to their protonation.
Later system-building functions assume these residue naming conventions. 
**Note**: support for alternative charge states varies between the  forcefields.

Charge +1    |  Neutral   | Charge -1
-------------|------------|----------
 -           |  ASH       | ASP
 -           |  CYS       | CYM
 -           |  GLH       | GLU
HIP          |  HID/HIE   |  -
LYS          |  LYN       |  -
 -           |  TYR       | TYM
ARG          |  AR0       |  -



The `proteinPrepare` function requires a `Molecule` object, the protein to be prepared, as an argument, and returns the prepared system, also as a `Molecule`. Logging messages will provide information and warnings about the process.

```python
def proteinPrepare(mol_in,
                   pH=7.0,
                   verbose=0,
                   returnDetails=False,
                   hydrophobicThickness=None,
                   holdSelection=None):
```

Returns a Molecule object, where residues have been renamed to follow
internal conventions on protonation (below). Coordinates are changed to
optimize the H-bonding network. This should be roughly comparable to
Schroedinger Maestro's preparation wizard.

## Parameters

    mol_in : htmd.Molecule
        the object to be optimized
    pH : float
        pH to decide titration
    verbose : int
        verbosity
    returnDetails : bool
        whether to return just the prepared Molecule (False, default) or a molecule *and* a ResidueInfo
        object including computed properties
    hydrophobicThickness : float
        the thickness of the membrane in which the protein is embedded, or None if globular protein.
        Used to provide a warning about membrane-exposed residues.
    holdSelection : str
        Atom selection to be excluded from optimization.
        Only the carbon-alpha atom will be considered for the corresponding residue.

`proteinPrepare()` is a convenience function. Using it
is **not** mandatory. You can 
manipulate the input molecule with your custom functions. 
In particular,

* Addition of hydrogen atoms is not required
* Protonation states are set by renaming residues
* HIS and other residues can be edited as coordinates



## Example

Prepare trypsin (PDB: 3PTB) at pH 7.

In [None]:
tryp = Molecule("3PTB")
tryp_op = proteinPrepare(tryp)

## Preparation report

If the `returnDetails` argument is set,  an object of type `ResidueData` is returned as a **second** return value. It carries a wealth of information on the preparation results. 

In [None]:
tryp_op, prepData = proteinPrepare(tryp, returnDetails=True)
prepData

Most of it is accessible in the `data` property (a pandas `DataFrame`).

In [None]:
prepData.data

As such, it can be easily queried and written as a spreadsheet in Excel or CSV format.

In [None]:
prepData.data.to_excel("/tmp/tryp_data.xlsx")

## Membrane proteins

Membrane-embedded proteins are in contact with an hydrophobic region which may alter pKa values for membrane-exposed residues ([Teixera et al., J. Chem. Theory Comput., 2016, 12 (3), pp 930–934](http://dx.doi.org/10.1021/acs.jctc.5b01114)). Although the effect is not currently   taken into account quantitatively, if a `hydrophobicThickness` argument is provided, warnings will be generated for residues exposed to the lipid region.

<img src="img/ct-2015-01114c_0002.jpeg" style="width: 70%"/>
<!-- http://pubs.acs.org/appl/literatum/publisher/achs/journals/content/jctcce/2016/jctcce.2016.12.issue-3/acs.jctc.5b01114/20160302/images/large/ct-2015-01114c_0002.jpeg -->

Residue pKa values along the membrane normal. Negative insertion values correspond to deeper membrane insertions, while positive values correspond to more shallow locations. The insertion values were measured between the titrable group and the phosphate from the nearest lipid (see Methods and Supporting Information for details). The aqueous bulk pKa values of the pentapeptides are shown on top for comparison. Ctr and Ntr correspond to the C- and N-terminus, respectively. The two horizontal lines at ∼1 Å and ∼−6 Å correspond to the average positions of the choline nitrogens and the second carbon atoms of the acyl chains, respectively.

The following example shows the preparation of the mu opioid receptor, 4DKL. 
The **pre-oriented** structure is retrieved  from the OPM database.

In [None]:
mor, thickness = htmd.util.opm("4dkl") 
print(thickness)
mor.filter("protein and noh")
mor_opt, mor_data = proteinPrepare(mor, returnDetails=True,
                                   hydrophobicThickness=thickness)

exposedRes = mor_data.data.membraneExposed
mor_data.data[exposedRes]
mor_data.data[exposedRes].to_excel("/tmp/mor_exposed_residues.xlsx")

# Step 2: Building

Only the basics - please find extensive tutorials at www.htmd.org .

In [None]:
# prot=Molecule("bentryp/trypsin.pdb")
# prot.filter('chain A and (protein or water or resname CA)')


## Case 1. Globular protein, no ligand

### Step 1a. Segment.

In [None]:
tryp = Molecule("3PTB")
tryp.remove("resname BEN")
tryp_op = proteinPrepare(tryp)
tryp_seg = autoSegment(tryp_op)

### Step 1b. Solvate

In [None]:
tryp_solv = solvate(tryp_seg,pad=5)
# tryp_solv.view()

### Step 1c. Build (+ionize) for CHARMM.

This step also ionizes the system (option `saltconc`).

In [None]:
topos  = ['top/top_all22star_prot.rtf']
params = ['par/par_all22star_prot.prm']

tryp_charmm = charmm.build(tryp_solv, topo=topos, param=params, outdir='/tmp/build-charmm')


### Step 1c (alternative). Build for AMBER.

This function also ionizes the system (option `saltconc`).

(TIP3P parameters for Ca++ are required - using `frcmod.ionslrcm_cm_tip3p`. See [link](https://github.com/pandegroup/openmm/issues/726) )

In [None]:
params = ["frcmod.ionslrcm_cm_tip3p"]
tryp_amber = amber.build(tryp_solv, param=params, outdir='/tmp/build-amber')

# Output

The output is both a `Molecule` object, and files in the output directory specified. 
These are topologies needed by the simulation software.

In [None]:
!ls  /tmp/build-charmm 

In [None]:
!ls  /tmp/build-amber

## Case 2. Building with a ligand

Coexistence and automatic placement of a ligand requires further manipulation,
because:

1. The ligand may have to be arranged in a geometrically sensible way
2. We likely need additional parameters and topologies (M. J. Harvey's parametrization session)

See the tutorial [System Building Trypsin-Benzamidine](https://www.htmd.org/docs/latest/tutorials/system-building-protein-ligand.html).

## Case 3. Membrane proteins

Pre-equilibrated membranes can be merged with pre-oriented systems,
e.g. downloaded from the OPM. See the tutorial [System Building μ-opioid Receptor in Membrane](https://www.htmd.org/docs/latest/tutorials/system-building-protein-in-membrane.html).

# Citations

Please acknowledge your use of PDB2PQR by citing...
 *   Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Jensen JH, Klebe G, Baker NA. PDB2PQR: Expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Nucleic Acids Res, 35, W522-5, 2007. 

For propKa...
 *   Sondergaard, Chresten R., Mats HM Olsson, Michal Rostkowski, and Jan H. Jensen. "Improved Treatment of Ligands and Coupling Effects in Empirical Calculation and Rationalization of pKa Values." Journal of Chemical Theory and Computation 7, no. 7 (2011): 2284-2295.
