# Input Maker for String-method simulations.

With this notebook you will be able to prepare the input files for string-method simulation as well as the optional preparatory steering simulation. The necessary `.mdp` files will be generated, the cvs will be defined as well as optionally the initial string. 

This notebook only deals with cvs that are distances between atoms or centers of mass of groups of atoms. Nevertheless, we invite you to try it to understand the logic of the pull coordinates and mdp file creation so you can later adapt it to your own cvs (dihedrals, angles etc). The main limitation of the CVs that can be used is the feature availability of the pull-code of gromacs. 

Let's get started!

Make sure you have installed the imports in this library. 

In [None]:
import sys
import MDAnalysis as mda
import numpy as np
import glob
import matplotlib.pyplot as plt
from math import ceil
import os
import shutil
import pickle
from string import ascii_lowercase

In [None]:
def distance_atom_groups(u, sel1, sel2, progressbar=True, center_of_mass=False):
    """
    Calculate the distance between the centers of geometry (or mass) between two groups (sel1, sel2) as a function of time in the trajectory trj.

    Parameters
    ----------
    u: MDA universe to analyz trajectory to analyze.
    sel1: MDA selection containing at least 1 atom.
    sel2: MDA selection containing at least 1 atom.
    center_of_mass: Use the center of mass instead of center of geometry.
    progressbar: Show progressbar.

    Returns
    -------
    d: matplotlib figure object.
    """
    from MDAnalysis import Universe
    from MDAnalysis import AtomGroup
    from numpy import array
    from tqdm import tqdm
    from numpy.linalg import norm

    assert isinstance(u, Universe), "u should be a MDAnlaysis universe."
    assert isinstance(sel1, AtomGroup), "sel1 should be a MDAnlaysis universe."
    assert isinstance(sel2, AtomGroup), "sel2 should be a MDAnlaysis universe."
    assert isinstance(progressbar, bool), "progressbar should be boolean."
    assert sel1.n_atoms >= 1, "sel1 should have at least 1 atom."
    assert sel2.n_atoms >= 1, "sel2 should have at least 1 atom."

    d = []
    for i, ts in tqdm(
        enumerate(u.trajectory), total=u.trajectory.n_frames, disable=not progressbar
    ):
        if center_of_mass:
            csel1 = sel1.center_of_mass()
            csel2 = sel2.center_of_mass()
        else:
            csel1 = sel1.centroid()
            csel2 = sel2.centroid()
        d.append([ts.dt * i, norm(csel1 - csel2)])
    return array(d)

## Choose working directory

In the cell bellow you can select which will be the simulation directory (in case this notebook is elsewhere). If the notebook is in the simulation directory just leave it as ".".

In [None]:
%ls ../data/interim/

In [None]:
simulation_directory = "../data/raw/C2I_v1_amber/"
os.chdir(simulation_directory)
os.getcwd()

## Choosing starting and final configurations

With `start.gro` and `end.gro` are used to define the initial and final values of the cvs in the string.

Note that since `.gro` files don't always have the best topology information might need to add some sort of topology file like so:
```python
start = mda.Universe('topology/top.pdb', 'topology/start.gro')
```
Of course, `start.pdb` or `end.pdb` can also be used directly

## Choosing number of bead on string

Choose the number of beads of the string. This should be done keeping in mind the parallelization conditions that will be used and if the first and last strings of the bead will be mobile or fixed. For the @DelemotteLab HPC environment 34 beads (32 of them moving) is a good starting point. Additional information about the parallelization can be found in the main `README.md` of the repository.

In [None]:
start = mda.Universe("topology/5VKH.pdb")

## Defining the CVs

The dictionary `ndx_groups` defines the index groups that will be added to `index0.ndx` and will be used by gmx to calculate the string cvs. The key-value pairs of the dictionary are the alias of the index group (no spaces please) and the `MDAnalysis` selection-string of the group. You can read more about MDAnalysis selections [here](https://docs.mdanalysis.org/stable/documentation_pages/selections.html). 

The cvs will be the distances between the centers of mass of the consequtive pairs of groups.

In this example bellow there would be two CVs: 
```python 
ndx_groups = {
    "CA_77_A": "name CA and resid 77 and segid PROA",
    "CA_77_B": "name CA and resid 77 and segid PROB",
    "112_A": "resid 112 and segid PROA",
    "13_C": "resid 13 and segid PROC",
}
```
1. The distance between CA atoms of resid 77 of segid PROA and resid 77 of segment PROB.
2. The distance between the center of mass of resid 112 of segid PROA and the center of mass of resid 13 of segid PROC. 

For this example we will use other CVs special for GPCRs. In this case we will select the atoms by using their index number. If a group or groups are involved in two distances, for the set up of this notebook, its best to write them twice in this list with a slightly different name. This is the case for `a_4334`  which is involved in two distances with `a_863` and also `a_1971`. For this reason we add the entries `a_863b` and also `a_1971b`.

The next cell will show you which pairs will be used as cvs:

In [None]:
ndx_groups = {
    "CA_77_A": "name CA and resid 77 and segid PROA",
    "CA_77_B": "name CA and resid 77 and segid PROB",
    "CA_77_C": "name CA and resid 77 and segid PROC",
    "CA_77_D": "name CA and resid 77 and segid PROD",
    #
    "CA_77_A_2": "name CA and resid 77 and segid PROA",
    "CA_77_C_2": "name CA and resid 77 and segid PROC",
    "CA_77_D_2": "name CA and resid 77 and segid PROD",
    "CA_77_B_2": "name CA and resid 77 and segid PROB",
    "CA_77_A_3": "name CA and resid 77 and segid PROA",
    "CA_77_D_3": "name CA and resid 77 and segid PROD",
    "CA_77_C_3": "name CA and resid 77 and segid PROC",
    "CA_77_B_3": "name CA and resid 77 and segid PROB",
    #
    "CA_104_A": "name CA and resid 104 and segid PROA",
    "CA_104_B": "name CA and resid 104 and segid PROB",
    "CA_104_C": "name CA and resid 104 and segid PROC",
    "CA_104_D": "name CA and resid 104 and segid PROD",
    #
    "CA_108_A": "name CA and resid 108 and segid PROA",
    "CA_108_B": "name CA and resid 108 and segid PROB",
    "CA_108_C": "name CA and resid 108 and segid PROC",
    "CA_108_D": "name CA and resid 108 and segid PROD",
    #
    "CA_112_A": "name CA and resid 112 and segid PROA",
    "CA_112_B": "name CA and resid 112 and segid PROB",
    "CA_112_C": "name CA and resid 112 and segid PROC",
    "CA_112_D": "name CA and resid 112 and segid PROD",
    #
    "CZ_103_A": "resid 103 and name CZ and segid PROA",
    "CG2_74_A": "resid 74 and name CG2 and segid PROA",
    "CZ_103_B": "resid 103 and name CZ and segid PROB",
    "CG2_74_B": "resid 74 and name CG2 and segid PROB",
    "CZ_103_C": "resid 103 and name CZ and segid PROC",
    "CG2_74_C": "resid 74 and name CG2 and segid PROC",
    "CZ_103_D": "resid 103 and name CZ and segid PROD",
    "CG2_74_D": "resid 74 and name CG2 and segid PROD",
    #
    "CD_100_A": "resid 100 and name CD and segid PROA",
    "CG2_75_A": "resid 75 and name CG2 and segid PROA",
    "CD_100_B": "resid 100 and name CD and segid PROB",
    "CG2_75_B": "resid 75 and name CG2 and segid PROB",
    "CD_100_C": "resid 100 and name CD and segid PROC",
    "CG2_75_C": "resid 75 and name CG2 and segid PROC",
    "CD_100_D": "resid 100 and name CD and segid PROD",
    "CG2_75_D": "resid 75 and name CG2 and segid PROD",
    #
    "CZ_103_A_2": "resid 103 and name CZ and segid PROA",
    "CD_100_D_2": "resid 100 and name CD and segid PROD",
    "CD_100_B_2": "resid 100 and name CD and segid PROB",
    "CZ_103_D_2": "resid 103 and name CZ and segid PROD",
    "CZ_103_B_2": "resid 103 and name CZ and segid PROB",
    "CD_100_C_2": "resid 100 and name CD and segid PROC",
    "CD_100_A_2": "resid 100 and name CD and segid PROA",
    "CZ_103_C_2": "resid 103 and name CZ and segid PROC",
    #
    "CA_114_A": "resid 114 and name CA and segid PROA",
    "CA_32_D": "resid 32 and name CA and segid PROD",
    "CA_32_B": "resid 32 and name CA and segid PROB",
    "CA_114_D": "resid 114 and name CA and segid PROD",
    "CA_114_B": "resid 114 and name CA and segid PROB",
    "CA_32_C": "resid 32 and name CA and segid PROC",
    "CA_32_A": "resid 32 and name CA and segid PROA",
    "CA_114_C": "resid 114 and name CA and segid PROC",
    #
    "CA_32_A_2": "resid 32 and name CA and segid PROA",
    "CA_114_A_2": "resid 114 and name CA and segid PROA",
    "CA_32_B_2": "resid 32 and name CA and segid PROB",
    "CA_114_B_2": "resid 114 and name CA and segid PROB",
    "CA_32_C_2": "resid 32 and name CA and segid PROC",
    "CA_114_C_2": "resid 114 and name CA and segid PROC",
    "CA_32_D_2": "resid 32 and name CA and segid PROD",
    "CA_114_D_2": "resid 114 and name CA and segid PROD",
    #
    "CA_118_A": "resid 118 and name CA and segid PROA",
    "CA_28_D": "resid 28 and name CA and segid PROD",
    "CA_28_B": "resid 28 and name CA and segid PROB",
    "CA_118_D": "resid 118 and name CA and segid PROD",
    "CA_118_B": "resid 118 and name CA and segid PROB",
    "CA_28_C": "resid 28 and name CA and segid PROC",
    "CA_28_A": "resid 28 and name CA and segid PROA",
    "CA_118_C": "resid 118 and name CA and segid PROC",
    #
    "OG1_107_A": "resid 107 and name OG1 and segid PROA",
    "OG1_101_D": "resid 101 and name OG1 and segid PROD",
    "OG1_101_B": "resid 101 and name OG1 and segid PROB",
    "OG1_107_D": "resid 107 and name OG1 and segid PROD",
    "OG1_107_B": "resid 107 and name OG1 and segid PROB",
    "OG1_101_C": "resid 101 and name OG1 and segid PROC",
    "OG1_101_A": "resid 101 and name OG1 and segid PROA",
    "OG1_107_C": "resid 107 and name OG1 and segid PROC",
    #
    "NE1_67_A": "resid 67 and name NE1 and segid PROA",
    "CG_80_A": "resid 80 and name CG and segid PROA",
    "NE1_67_B": "resid 67 and name NE1 and segid PROB",
    "CG_80_B": "resid 80 and name CG and segid PROB",
    "NE1_67_C": "resid 67 and name NE1 and segid PROC",
    "CG_80_C": "resid 80 and name CG and segid PROC",
    "NE1_67_D": "resid 67 and name NE1 and segid PROD",
    "CG_80_D": "resid 80 and name CG and segid PROD",
    #
    "OE1_71_A_2": "resid 71 and name OE1 and segid PROA",
    "CA_68_A": "resid 68 and name CA and segid PROA",
    "OE1_71_B_2": "resid 71 and name OE1 and segid PROB",
    "CA_68_B": "resid 68 and name CA and segid PROB",
    "OE1_71_C_2": "resid 71 and name OE1 and segid PROC",
    "CA_68_C": "resid 68 and name CA and segid PROC",
    "OE1_71_D_2": "resid 71 and name OE1 and segid PROD",
    "CA_68_D": "resid 68 and name CA and segid PROD",
    #
    "OE1_71_A": "resid 71 and name OE1 and segid PROA",
    "HN_78_A": "resid 78 and name HN and segid PROA",
    "OE1_71_B": "resid 71 and name OE1 and segid PROB",
    "HN_78_B": "resid 78 and name HN and segid PROB",
    "OE1_71_C": "resid 71 and name OE1 and segid PROC",
    "HN_78_C": "resid 78 and name HN and segid PROC",
    "OE1_71_D": "resid 71 and name OE1 and segid PROD",
    "HN_78_D": "resid 78 and name HN and segid PROD",
    #
    "CA_74_A": "resid 74 and name CA and segid PROA",
    "CA_79_A": "resid 79 and name CA and segid PROA",
    "CA_74_B": "resid 74 and name CA and segid PROB",
    "CA_79_B": "resid 79 and name CA and segid PROB",
    "CA_74_C": "resid 74 and name CA and segid PROC",
    "CA_79_C": "resid 79 and name CA and segid PROC",
    "CA_74_D": "resid 74 and name CA and segid PROD",
    "CA_79_D": "resid 79 and name CA and segid PROD",
    #
#    "O_76_A": "resid 76 and name O and segid PROA",
#    "HN_77_D": "resid 77 and name HN and segid PROD",
#    "O_76_B": "resid 76 and name O and segid PROB",
#    "HN_77_C": "resid 77 and name HN and segid PROC",
#    "O_76_C": "resid 76 and name O and segid PROC",
#    "HN_77_A": "resid 77 and name HN and segid PROA",
#    "O_76_D": "resid 76 and name O and segid PROD",
#    "HN_77_B": "resid 77 and name HN and segid PROB",
    #
#    "OE1_71_A": "resid 71 and name OE1 and segid PROA",
#    "HE1_67_A": "resid 67 and name HE1 and segid PROA",
#    "OE1_71_B": "resid 71 and name OE1 and segid PROB",
#    "HE1_67_B": "resid 67 and name HE1 and segid PROB",
#    "OE1_71_C": "resid 71 and name OE1 and segid PROC",
#    "HE1_67_C": "resid 67 and name HE1 and segid PROC",
#    "OE1_71_D": "resid 71 and name OE1 and segid PROD",
#    "HE1_67_D": "resid 67 and name HE1 and segid PROD",
#    #
#    "OE1_71_A": "resid 71 and name OE1 and segid PROA",
#    "HE1_67_A": "resid 67 and name HE1 and segid PROA",
#    "OE1_71_B": "resid 71 and name OE1 and segid PROB",
#    "HE1_67_B": "resid 67 and name HE1 and segid PROB",
#    "OE1_71_C": "resid 71 and name OE1 and segid PROC",
#    "HE1_67_C": "resid 67 and name HE1 and segid PROC",
#    "OE1_71_D": "resid 71 and name OE1 and segid PROD",
#    "HE1_67_D": "resid 67 and name HE1 and segid PROD",
#    #
    "CD1_67_A": "resid 67 and name CD1 and segid PROA",
    "CG_81_A": "resid 81 and name CG and segid PROA",
    "CD1_67_B": "resid 67 and name CD1 and segid PROB",
    "CG_81_B": "resid 81 and name CG and segid PROB",
    "CD1_67_C": "resid 67 and name CD1 and segid PROC",
    "CG_81_C": "resid 81 and name CG and segid PROC",
    "CD1_67_D": "resid 67 and name CD1 and segid PROD",
    "CG_81_D": "resid 81 and name CG and segid PROD",
}
n_groups = len(ndx_groups.keys())

In [None]:
cvs = []
for i in range(1, len(ndx_groups) + 1, 2):
    cvs.append([i, i + 1])
n_cvs = len(cvs)
print("Pairs of groups whose distance are cvs:")
keys_groups = list(ndx_groups.keys())
for i in range(n_cvs):
    print(f"cv{i} {keys_groups[cvs[i][0]-1]} - {keys_groups[cvs[i][1]-1]}")

If your atoms groups have more than 1 atom. With the code bellow you can check the masses of the atoms used in the groups to see if something is fishy:

In [None]:
print("Masses for the start config.")
print(" ")
for key in ndx_groups.keys():
    mass = start.select_atoms(ndx_groups[key]).masses
    print(f"Masses of {key}: ", end=" ")
    for i in mass:
        print(i, end=" ")
    print()

## Choosing force constants

The list `kappas` contains the force constants for the steered simulation, the restrained portions of the string-simulation and the swarms of the the string-simulation.

The force constant of the swarm simulation should always be 0.

In [None]:
kappas = [10000.0, 10000.0, 0.0]
assert kappas[2] == 0.0, "The kappa of the swarm simulation should be 0"

These are the printing frequency of the cvs (pull-coordinates), best not to modify.

In [None]:
nstxout = [50000, 5000, 5000]

## Making the input files:

At this point you need to modify `swarms.mdp`, `restrained.mdp` and `steered.mdp`. There are instructions inside of which parts need to be modified. The pull section will be modified by this notebook.

The cell bellow will append the pull-coord parameters to the `mdp` files and append the groups for cvs to `index0.ndx` generating `index.ndx`.

A pickle file with the `cvs`and `ndx_groups` will be generated for future reference.

If you are re-running this notebook to check the steering simulation set `write_mdps = False`.

In [None]:
write_mdps = True

In [None]:
pickle.dump([cvs, ndx_groups], open("cv_steer.pkl", "wb"))

In [None]:
shutil.copy("topology/index0.ndx", "topology/index.ndx")

for key in ndx_groups.keys():
    group = start.select_atoms(ndx_groups[key])
    group.write("topology/index.ndx", name=key, mode="a")

files = [
    open(file, "r").readlines()
    for file in ["mdp/steered.mdp", "mdp/restrained.mdp", "mdp/swarms.mdp"]
]

for j, file in enumerate(files):
    final_line = len(file)
    for i, line in enumerate(file):
        if line.strip() == ";start pull":
            final_line = i
            break
    if line.strip() != ";start pull":
        file.append(";start pull")
    files[j] = file[0 : final_line + 1]

for f, file in enumerate(files):
    file.append("\n")
    file.append("pull = yes\n")
    file.append(f"pull-ngroups = {n_groups}\n")
    file.append("\n")
    for i, key in enumerate(ndx_groups.keys()):
        file.append(f"pull-group{i+1}-name = {key}\n")
    file.append("\n")
    file.append(f"pull-ncoords = {n_cvs}\n")
    for i, cv in enumerate(cvs):
        if len(cv) == 2:
            file.append(f"pull-coord{i+1}-geometry = distance\n")
            file.append(f"pull-coord{i+1}-k = {kappas[f]}\n")
            g = " ".join(str(e) for e in cv)
            file.append(f"pull-coord{i+1}-groups = {g}\n")
    file.append("\n")
    file.append("pull-print-components = no\n")
    file.append(f"pull-nstxout = {nstxout[f]}\n")
    file.append("pull-nstfout = 0\n")

if write_mdps:
    for f, file_string in enumerate(
        ["mdp/steered.mdp", "mdp/restrained.mdp", "mdp/swarms.mdp"]
    ):
        with open(file_string, "w") as file:
            for line in files[f]:
                file.write(line)

Now that the `.mdp` files and indeces have been made it is always a good idea to check them out and see they do what you want them to. The pull coordinates are missing some parameters that are automatically added during simulation. Here are mdp parameters of gmx for [reference](https://manual.gromacs.org/documentation/2020/user-guide/mdp-options.html#com-pulling).

## Making the initial string `string0.txt`

If you have made your own string you can skip this section and place its corresponding file in `strings/string0.txt`. Remember this file must follow the `np.savetxt` format and have shape (n_bead, ncvs).

If you don't have a string0.txt. You can created with the code below. The code makes a linear interpolation between the value of the CVs at `start.gro` and `end.gro`. If you want something fancier, you can program any thing you want.

In [None]:
# 12+12+12
n_beads = 12

In [None]:
%rm string/string_*

In [None]:
universes = [
mda.Universe("topology/5VKH.pdb"),
mda.Universe("topology/3FB5.pdb"),
mda.Universe("topology/5VK6.pdb"),
mda.Universe("topology/5VKE.pdb"),
]
initial=universes[0]
letters = ascii_lowercase
for i, final in enumerate(universes[1:]):
    dis_s = []
    dis_e = []
    for cv in cvs:
        dis_s.append(
            distance_atom_groups(
                initial,
                initial.select_atoms(ndx_groups[list(ndx_groups.keys())[cv[0] - 1]]),
                initial.select_atoms(ndx_groups[list(ndx_groups.keys())[cv[1] - 1]]),
                progressbar=False,
                center_of_mass=True,
            )[0][1]
        )
        dis_e.append(
            distance_atom_groups(
                final,
                final.select_atoms(ndx_groups[list(ndx_groups.keys())[cv[0] - 1]]),
                final.select_atoms(ndx_groups[list(ndx_groups.keys())[cv[1] - 1]]),
                progressbar=False,
                center_of_mass=True,
            )[0][1]
        )
    dis_s = np.array(dis_s)
    dis_e = np.array(dis_e)
    string = np.linspace(dis_s, dis_e, n_beads) / 10
    np.savetxt(f"strings/string_{letters[i]}.txt", string)
    initial = final
    print(f"strings/string_{letters[i]}.txt")

In [None]:
%ls strings

Print out the string.

In [None]:
string = np.concatenate(
    [
        np.loadtxt("strings/string_a.txt"),
        np.loadtxt("strings/string_b.txt")[1:],
        np.loadtxt("strings/string_c.txt")[1:],
    ]
)

FROM CLOSED TO INACTIVATED

In [None]:
k = np.linspace(string[0, 48:56], string[22, 48:56], 23)

In [None]:
string[0:23, 48:56] = k

In [None]:
k = np.concatenate([
    np.linspace(5.2,7, 3),
    np.linspace(7, 7, 9),
    np.linspace(7, 5.2, 4),
    np.linspace(5.2, 5.2, 18),
])/10.

In [None]:
k = k[::-1]

In [None]:
k

In [None]:
for i in range(1,5):
    string[:,-i] = k

In [None]:
np.savetxt("strings/string0.txt", string)

In [None]:
np.savetxt("strings/string_steer.txt", string)

## Visualize the string

In [None]:
with open("topology/show_cv.tcl","w") as file:
    file.write("set mol [molinfo top]\n")
    for cv in cvs:
        i=start.select_atoms(ndx_groups[list(ndx_groups.keys())[cv[0] - 1]]).indices[0]
        j=start.select_atoms(ndx_groups[list(ndx_groups.keys())[cv[1] - 1]]).indices[0]
        file.write(f"label add Bonds $mol/{i} $mol/{j}\n")

Independently of if you have made your own string or with this program you can visualize it with the cell bellow. 

In [None]:
string = np.loadtxt("strings/string0.txt")
n_plots = string.shape[1]
fig, ax = plt.subplots(ceil(n_plots / 2), 2, figsize=(15, 4 * ceil(n_plots / 2)))
ax = ax.flatten()
for i in range(n_plots):
    ax[i].plot(string[:, i], ls="", marker="x", label="string0")
    ax[i].set_xlabel(
        f"{list(ndx_groups.keys())[2*i]} - {list(ndx_groups.keys())[2*i+1]}", size=16
    )
    ax[i].set_ylabel("d (nm)", size=16)
    ax[i].tick_params(axis="both", which="major", labelsize=13)
    ax[i].set_title(f"cv{i}")
ax[1].legend()
if n_plots % 2:
    fig.delaxes(ax[-1])
fig.tight_layout()

## Next steps

At this point, we are ready to prepared to run the steering simulations if needed. Refer to the README.md for further indications.