# Definition of Milestones

In this notebook, we will perform dimensionality reduction and define milestones by a method based on that of [TCC2020]. The input data for this method consists of an unbinding trajectory $x_1{\to}x_2{\to}\dots{\to}x_L$, typically obtained by an enhanced sampling method such as metadynamics, and a reference structure $x_0$, with respect to which aligned Cartesian coordinates are defined. The time interval between trajectory frames is typically chosen so that $L$ is in the range of 5000 to 10000. The main difference from the method of [TCC2020] is that, here, milestones are defined via Voronoi tessellation. This avoids the procedure previously used to ensure that milestones are mutually disjoint, and facilitates the assignment of frames to the cells between milestones.

The workflow consists of the following steps:
1. Data input and featurization (i.e., extraction of aligned Cartesian coordinates of a selected group of atoms).
2. Linear dimensionality reduction via principal component analysis (PCA). Featurized data is projected onto the subspace spanned by the first two principal components.
3. Manual definition of a piecewise-linear "guess path" through the projected data points.
4. Smoothing and reparametrization of the path to yield a sequence of anchor points with chosen spacing.

Milestones are defined as the boundaries between neighboring cells in the Voronoi diagram generated by the anchor points.

### References

[TCC2020] Z. Tang, S.-H. Chen, and C.-e. A. Chang, <a href="https://doi.org/10.1021/acs.jctc.9b01153">J. Chem. Theory Comput.</a> **16**, 1882 (2020).

In [1]:
%matplotlib ipympl

import matplotlib.pyplot as plt
import mdtraj as md
import numpy as np
import pyemma

Specify input files (see http://mdtraj.org/1.9.3/load_functions.html for supported formats):

In [2]:
topfile = '/data/CDK8CycC-PL3/protein.prmtop' # Define these via
reffile = '/data/CDK8CycC-PL3/0.pdb'          # a configure script?
trajfiles = ['/data/CDK8CycC-PL3/metadynamics/nowater1.dcd']

(Note that `reffile` may be a trajectory file, in which case the first frame will be used as the reference structure.)

Load topology and reference structure:

In [3]:
topology = md.load_topology(topfile)
reference = md.load_frame(reffile, 0, top=topology)

print(topology)
print(reference)

<mdtraj.Topology with 1 chains, 620 residues, 10346 atoms, 10480 bonds>
<mdtraj.Trajectory with 1 frames, 10346 atoms, 620 residues, without unitcells>


Create a featurizer using the topology, and define the trajectory data source:

In [4]:
feat = pyemma.coordinates.featurizer(topology)
reader = pyemma.coordinates.source(trajfiles, features=feat)

Following [TCC2020], we select as input features the aligned Cartesian coordinates of all C&alpha; atoms of the protein and all heavy atoms of the ligand (residue 620). Superposition is done using the backbone atoms of CDK8 (residues 1 to 359).

In [5]:
selection = topology.select('(name CA) or (resid 619 and not element H)')
atoms_to_superpose = topology.select('resid 0 to 358 and backbone')

feat.add_selection(selection, reference=reference, atom_indices=atoms_to_superpose)

print('Number of features:', feat.dimension())
print('Number of atoms:', len(selection))

Number of features: 1935
Number of atoms: 645


In [6]:
import pickle
with open('featurizer.pickle', 'wb') as f:
    pickle.dump(feat, f)

In [7]:
with open('featurizer.pickle', 'rb') as f:
    feat2 = pickle.load(f)

In [8]:
reader2 = pyemma.coordinates.source(trajfiles, features=feat2)

In [9]:
pca = pyemma.coordinates.pca(reader2, dim=2)
pca_output = pca.get_output()

print(pca)

PCA(dim=2, mean=array([ 3.26382, -0.80302, ..., -0.31983,  0.25863]), skip=0,
  stride=1, var_cutoff=0.95)


To reduce the dimension of the data, we employ PCA, which finds a $d$-dimensional subspace of maximal variance. We retain the first two principal components ($d=2$).

In [10]:
pca = pyemma.coordinates.pca(reader, dim=2)
pca_output = pca.get_output()

print(pca.model)

PCAModel(eigenvectors=array([[-0.01088,  0.02742, ...,  0.00379, -0.00092],
       [-0.01797,  0.02219, ...,  0.00306,  0.00226],
       ...,
       [ 0.09441,  0.08122, ...,  0.00786, -0.25649],
       [ 0.03114,  0.07071, ...,  0.03706,  0.2692 ]]),
     mean=array([ 3.26382, -0.80302, ..., -0.31983,  0.25863]))


Define a path through manually selected waypoints:

In [13]:
import sys
sys.path.append('..')
import bkit.plots, bkit.util

fig, xout, yout = bkit.plots.path_input_plot(pca_output, naverage=200)
plt.xlabel('PC1')
_ = plt.ylabel('PC2')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [14]:
spacing = 0.6 # milestone spacing (arc length)

path = np.array([xout, yout]).T
path_new, anchors = bkit.util.interpolate_path(path, image_spacing=spacing)

In [19]:
from scipy.spatial import Voronoi, voronoi_plot_2d
from bkit.milestoning import TrajectoryDecomposer

vor = Voronoi(anchors)
fig = voronoi_plot_2d(vor, show_vertices=False)

d = TrajectoryDecomposer(anchors)
dtrajs = d._kdtree.query(pca_output)[1]

for i, Y in enumerate(pca_output):
    plt.scatter(*Y.T, s=1, c=np.mod(dtrajs[i], 8), cmap='Accent', zorder=-1)
_ = plt.xlabel('PC1'), plt.ylabel('PC2')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [None]:
from scipy.spatial import cKDTree

cutoff = 1.0 # maximum distance from an anchor

Y = pca_output[0]

anchor_tree = cKDTree(anchors)
_, indices = anchor_tree.query(Y, distance_upper_bound=cutoff)

close = indices < len(anchors)
far = np.logical_not(close)

fig = voronoi_plot_2d(vor, show_vertices=False)
plt.scatter(*Y[close].T, s=1, c=np.mod(indices[close], 8), cmap='Accent', zorder=-1)
plt.scatter(*Y[far].T, s=1, color='lightgray', zorder=-1)
_ = plt.xlabel('PC1'), plt.ylabel('PC2')

Save PCA transformation object and anchor points:

In [None]:
pca.save('pca.h5', overwrite=True)
np.save('anchors.npy', anchors)