# Definition of a state space partition

In this notebook, we will perform dimensionality reduction and define milestones by a method based on that of [TCC2020]. The input data for this method consists of one or more unbinding trajectories, typically obtained by an enhanced sampling method such as metadynamics, and a reference structure, with respect to which aligned Cartesian coordinates are defined.

The workflow consists of the following steps:
1. Data input and featurization (i.e., extraction of aligned Cartesian coordinates of a selected group of atoms).
2. Projection of featurized data onto the subspace spanned by the first two principal components.
3. Manual definition of a piecewise-linear "guess path" through the projected data points.
4. Smoothing and reparametrization of the path to yield a sequence of anchor points with chosen spacing.

Milestones are identified with pairs of (adjacent) cells in the Voronoi partition generated by the anchor points.

**Output** consists of the following files:
- *anchors.npy*, a file containing the anchor points.
- *pca.h5*, a serialized `pyemma.coordinates.transform.PCA` object, used by the script *project_trajs.py* to project MD trajectories onto the components fit in step 2.


### References

[TCC2020] Z. Tang, S.-H. Chen, and C.-e. A. Chang, <a href="https://doi.org/10.1021/acs.jctc.9b01153">J. Chem. Theory Comput.</a> **16**, 1882 (2020).

In [1]:
%matplotlib ipympl

import matplotlib.pyplot as plt
import mdtraj as md
import numpy as np
import pyemma.coordinates

Input files (see http://mdtraj.org/1.9.4/load_functions.html for supported formats):

In [2]:
topfile = '/data/CDK8CycC-PL3/protein.prmtop'
reffile = '/data/CDK8CycC-PL3/0.pdb'
trajfiles = ['/data/CDK8CycC-PL3/metadynamics/nowater1.dcd']

Note that `reffile` may be a trajectory file (as opposed to a PDB file), in which case the first frame will be used as the reference structure.

We start by loading the topology and reference structure:

In [3]:
topology = md.load_topology(topfile)
reference = md.load_frame(reffile, 0, top=topology)

print(topology)
print(reference)

<mdtraj.Topology with 1 chains, 620 residues, 10346 atoms, 10480 bonds>
<mdtraj.Trajectory with 1 frames, 10346 atoms, 620 residues, without unitcells>


Following [TCC2020], we select as input features the aligned Cartesian coordinates of all C&alpha; atoms of the protein and all heavy atoms of the ligand (residue 620). Superposition is done using the backbone atoms of CDK8 (residues 1 to 359).

In [4]:
selection = topology.select('(name CA) or (resid 619 and not element H)')
atoms_to_superpose = topology.select('resid 0 to 358 and backbone')

feat = pyemma.coordinates.featurizer(topology)
feat.add_selection(selection, reference=reference, atom_indices=atoms_to_superpose)

print('Number of features:', feat.dimension())

Number of features: 1935


Define the trajectory data source and specify the features to be read.

In [5]:
reader = pyemma.coordinates.source(trajfiles, features=feat)

To reduce the dimension of the data, we project it onto the first two principal components.

In [6]:
pca = pyemma.coordinates.pca(reader, dim=2)
print(pca)

PCA(dim=2, mean=array([ 3.26382, -0.80302, ..., -0.31983,  0.25863]), skip=0,
  stride=1, var_cutoff=0.95)


The next step is an uncomfortably subjective one: We manually define a piecewise-linear "guess path" through the data points, using the smoothed trajectory (red curve) as a visual guide.

In [9]:
from plots import path_input_plot

fig, ax, line = path_input_plot(pca.get_output(), window=200)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
_ = ax.set_title('Left-click to add points. Right-click to delete.')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Now we smooth and reparametrize the manually defined path. The smoothing employs a fine-grained cubic-spline interpolation. Reparametrization consists of parametrizing the smoothed path by arc length. Anchor points are placed at a spacing of `anchor_spacing` along this new path. 

**Note:** Running the next cell will modify the above plot.

In [11]:
anchor_spacing = 0.6

from util import interpolate_path

path = np.column_stack([line.get_xdata(), line.get_ydata()])
path_new, anchors = interpolate_path(path, image_spacing=anchor_spacing)

ax.set_title('')
ax.scatter(*anchors.T, color='k', marker='*', zorder=10, label='anchor points')
ax.legend()
line.remove()

As a final step, we save the anchor points and PCA transformation object. (See comments below.)

In [12]:
np.save('anchors.npy', anchors)

In [None]:
import pandas as pd

csv = pd.read_csv('/data/CDK8CycC-PL2/finalpath.txt', header=None, delimiter=r"\s+")
anchors = np.asarray(csv)[:, 1:] / 10. + pca_output[0][0]

In [10]:
np.save('anchors.npy', anchors)
for serial, atom in enumerate(topology.atoms):
    atom.serial = serial
pca.save('pca.h5', save_streaming_chain=True)

OSError: Unable to open file (file signature not found)

- The keyword argument `save_streaming_chain=True` ensures we store the featurization data needed to project additional MD trajectories onto the eigenvectors.
- The manual assignment of atom serial numbers is a hack. The serial numbers are assigned a value of `None` when our topology file is loaded using MDTraj. In order for the final line to execute, it is necessary that they be integers.