# Druggability project

### Andreu Bofill, Inés Sentís, Mariona Torrens, Alejandro Varela

This project aims to provide a simple platform to detect among a set of ligands and a protein if their interaction result in a system with a free energy lower than -2 kcal/mol. This would reflect a good interaction between the ligand and the protein which is a very interesting property in a drug as, ideally, a low energy of interaction may correspond with a good drug candidate.


In [None]:
import os
from htmd import *
from htmd.molecule.util import maxDistance
from htmd.protocols.equilibration_v1 import Equilibration
from htmd.protocols.production_v1 import Production
from htmd.parameterize import Configuration, Parameterisation
from natsort import natsorted
import sys
import argparse
import random

As this program must be executed from command line some arguments should be specified. There are two options: or pass to the program just the protein pdb file and the ligand mol2 file or pass the program all the files of the parameterization of the ligand. In case there is a missing file an exeption will show up. In  the first case, you have to use --mol2 option to indicate the path to the ligand file and --prot plus the path to the protein pdb as it is showed below. Make sure that in the file`parameters.config` you indicate the correct molecular charge in the netcharge field.   

For example:

In [None]:
$python3 SimulationModule.py --prot bentryp/trypsin.pdb --mol2 benzamidine.mol2

However, if you already have the topology(rtf) and parameter's (prm) files from the ligand you can pass them as arguments to the python program using the options --rtf and --prm  plus --ligand and the path to the corresponding files:

In [None]:
$python3 SimulationModule.py --prot bentryp/trypsin.pdb --ligand bentryp/benzamidine.pdb 
    --rtf bentryp/benzamidine.rtf -prm bentryp/benzamidine.prm`   

In [None]:
parser = argparse.ArgumentParser(description="Druggability Project")
parser.add_argument('-l', '--ligand',
dest='ligand',
action='store',
default=None,
required=False,
help='Ligand path')

In [None]:
parser.add_argument('-p', '--prot',
dest='prot',
action='store',
default=None,
required=True,
help='Protein path')

In [None]:
parser.add_argument('-rtf', '--rtf',
dest='rtf',
action='store',
default=None,
required=False,
help='rtf path')

In [None]:
parser.add_argument('-prm', '--prm',
dest='params',
action='store',
default=None,
required=False,
help='Params path')

In [None]:
parser.add_argument('-c', '--config',
dest='config',
action='store',
default='./parameters.config',
required=False,
help='Parameters configuration file')

In [None]:
parser.add_argument('-mol2', '--mol2',
dest='mol2',
action='store',
default=None,
required=False,
help='mol2 file to generate rtf and prm files')

args = parser.parse_args()

In [None]:
def check_arguments():
    if not args.prot:
        sys.stderr.write("Error: You forget to put the protein file path\n")
        exit(1)
    if args.ligand:
        if args.params and args.rtf:
            if args.mol2:
                sys.stderr.write("Error: You Introduce both options: mol2 and pdb,rtf,prm files.\ 
                                 Choose only one option\n")
                exit(1)
            ligand_path = args.ligand
            rtf_path = args.rtf
            params_path = args.params
        else:
            sys.stderr.write("Error: You introduce a ligand pdb file, but rtf and prm files are missing.\
            Introduce them with -rtf and -prm input options \n")
    if not args.ligand or not args.params or not args.rtf:
        if not args.mol2:
            sys.stderr.write("You need to introduce one ligand input options: a mol2 file, or  pdb,rtf and prm files.\n")
            exit(1)
        if args.mol2:
            (ligand_path,rtf_path,params_path)=parameter(args.mol2, netcharge)
    return(ligand_path,rtf_path,params_path)

In [None]:
def parse_config (config_file):
    op_config = open(config_file, "r")
    for line in op_config:
        if line.startswith("nbuilds"):
            nbuilds = line.split("\t")[1].strip()
        if line.startswith("minsim"):
            minsim = line.split("\t")[1].strip()
        if line.startswith("maxsim"):
            maxsim = line.split("\t")[1].strip()
        if line.startswith("run_time"):
            run_time = line.split("\t")[1].strip()
        if line.startswith("numbep"):
            numbep = line.split("\t")[1].strip()
        if line.startswith("dimtica"):
            dimtica = line.split("\t")[1].strip()
        if line.startswith("sleeping"):
            sleeping = line.split("\t")[1].strip()
        if line.startswith("netcharge"):
            netcharge = line.split("\t")[1].strip()
            print(netcharge)
    return(nbuilds, run_time, minsim, maxsim, numbep, dimtica, sleeping, netcharge)

The function *parameter* would be executed just in case the parametrization files of the ligand were not specified at the beginning. This process is really computationally demanding and depends on the number of atoms of the molecule being parameterized. Charge of the ligand must also be specified.


In [None]:
def parameter(mol2, netcharge):
    molec = Molecule(mol2)
    config = Configuration()
    config.FileName = mol2
    molec_name = str(mol2)
    molec_name = molec_name.split(".")[0]
    config.JobName = molec_name.split("/")[-1]+str(random.randint(1,1000))
    config.NetCharge = netcharge
    param = Parameterisation(config=config)
    paramfiles = param.getParameters()
    shutil.copyfile(paramfiles['RTF'], molec_name+".rtf")
    shutil.copyfile(paramfiles['PRM'], molec_name+".prm")
    shutil.copyfile(paramfiles['PDB'], molec_name+".pdb")
    ligand_path = molec_name+".pdb"
    params_path = molec_name+".prm"
    rtf_path = molec_name+".rtf"
    return(ligand_path, params_path, rtf_path)


Once we have the ligand parameterized, this platform initializes the system by doing a docking between the ligand and the protein using the *dock* function of HTMD. The top 5 poses are used to build the systems, each pose is built independently. The point of starting with docked position is that it ensures a good starting point to run a simulation and saves time and computer resources.

In [None]:
def dockinit(protein_path, ligand_path):
    prot = Molecule(protein_path)
    prot.filter('protein or water or resname CA')
    prot.set('segid', 'P', sel='protein and noh')
    prot.set('segid', 'W', sel='water')
    prot.set('segid', 'CA', sel='resname CA')
    D = maxDistance(prot, 'all')
    D = D + 15
    prot.center()
    lig = Molecule(ligand_path)
    poses, scores = dock(prot, lig)
    return (prot, poses, D)

Each of the five different poses are solvated and a salt concentration of 0.15  is added as we have seen in the HTMD documentation.

In [None]:
def building(prot,poses,D,path_ligand_rtf,path_ligand_prm,nbuilds=4):
    moltbuilt=[]
    for i, p in enumerate(poses):
        ligand = p
        ligand.set('segid','L')
        ligand.set('resname','MOL')
        mol = Molecule(name='combo')
        mol.append(prot)
        mol.append(ligand)

        smol = solvate(mol, minmax=[[-D, -D, -D], [D, D, D]])
        topos  = ['top/top_all22star_prot.rtf', 'top/top_water_ions.rtf',path_ligand_rtf] #'./ethtryp/ethanol.rtf'
        params = ['par/par_all22star_prot.prm', 'par/par_water_ions.prm', path_ligand_prm] #'./ethtryp/ethanol.prm'

        moltbuilt.append(charmm.build(smol, topo=topos, param=params, outdir='./docked/build/{}/'.format(i+1), 
                                      saltconc=0.15))
        if i==nbuilds:
            break

After this, an equilibration protocol is performed over each system. This allows us to stablish a temperature of 298 Kelvin on each system using 1000 time steps.

In [None]:
def Equilibrate():
    md = Equilibration()
    md.numsteps = 1000
    md.temperature = 298
    builds=natsorted(glob('docked/build/*/'))
    for i,b in enumerate(builds):
        md.write(b,'docked/equil/{}/'.format(i+1))
    mdx = AcemdLocal()
    mdx.submit(glob('./docked/equil/*/'))
    mdx.wait()

The already equilibrated systems enter the production step where trajectories for each system are created using the Newton equations of motion. In this step, a 'generators' directory is created. It will contain 5 folders (as we have stablished by default) with a one simulation each. The generators are only used in the first epoch. 

In [None]:
def Produce(run_time=50):
    equils=natsorted(glob('docked/equil/*/'))
    for i,b in enumerate(equils):
        md= Production()
        md.acemd.bincoordinates = 'output.coor'
        md.acemd.extendedsystem  = 'output.xsc'
        md.acemd.binvelocities=None
        md.acemd.binindex=None
        md.acemd.run=str(run_time)+'ns'
        md.temperature = 300
        equils=natsorted(glob('docked/equil/*/'))
        md.write('./docked/equil/{}/'.format(i+1), 'docked/generators/{}/'.format(i+1))

    mdx = AcemdLocal()
    mdx.submit(glob('./docked/generators/*/'))
    mdx.wait()

Finally, we run adaptive to generate the trajectories which will eventually be used for the ligand binding analysis. A folder called 'filtered' will be created in the working directory which will contain the filtered trajectories (without water). The point of doing adaptative is to accelerate the simulation proccess by selecting those results that represent an advanced position to avoid repetition from the beginning and explore more space. 

In [None]:
def adaptive(minsim=6,maxsim=8,numbep=12,dimtica=3,sleeping=14400):
    md = AdaptiveRun()
    md.nmin=minsim
    md.nmax=maxsim
    md.nepochs = numbep
    md.app = AcemdLocal()
    md.generatorspath='./docked/generators/'
    md.datapath='./docked/generators/'
    md.inputpath='./docked/generators/'
    md.dryrun = False
    md.metricsel1 = 'name CA'
    md.metricsel2 = 'resname MOL and noh'
    md.metrictype = 'contacts'
    md.ticadim = dimtica
    md.updateperiod = sleeping
    md.run()

## Analysis of the results:

Once your epochs are generated, we can analyse the interaction between the ligands and the protein. This part of the program, which you can find at the file *Analysis.ipynb*, needs to be run from *jupyter notebook*, otherwise we wouldn't be able to visualize some plots which are important for our analysis.   

Fisrt of all, we obtain a simulation list from our trajectories, created at the previous stage of the progran.

In [None]:
sims = simlist(glob('./filtered/*/'), './filtered/filtered.pdb')

To build a Markov state model we need to project the atom coordinates onto a lower dimensional space, which can be used for clustering the conformations into a set of states. To do this, we use a binary contact map between the carbon alpha atoms of the protein and the ligand.

In [None]:
metr = Metric(sims)
metr.projection(MetricDistance('protein and name CA', 'resname MOL and noh', metric='contacts'))
data = metr.project()

We define which is the frame-step, in nanoseconds.

In [None]:
data.fstep = 0.1

We visualize now the length of the trajectories to see if they are equal. The trajectories that are not equal to the mode are eliminated because, probably, they are corrupted.

In [None]:
data.plotTrajSizes()
data.dropTraj()

TICA is performed to achive greater differentiation of metastable  minima.

In [None]:
tica = TICA(data, 10)
dataTica = tica.project(3)

We apply bootstrap. Then, we cluster the conformations: we produce 1000 clusters, clusters containinf less than 5 conformations will be merged into the next neighbour.

In [None]:
dataBoot = dataTica.bootstrap(0.8)
dataBoot.cluster(MiniBatchKMeans(n_clusters=1000), mergesmall=5) #try with dataTica instead of dataBoot

Once the clustering is done, it is time to construct the markov model. To do this, an ITS plot has to be observed and see at which time lag time do timescales start converging and also, to see how many different timescales there are.

In [None]:
model = Model(dataBoot)
model.plotTimescales() 

According to the ITS plot, we choose the lag-time and the numer if macrostate. In this example, we set these values to 50 and 5, respectively.

In [None]:
model.markovModel(50, 5) 

Now we will visualize the states: we load the 3 macrostates and add a ligand representation.

In [None]:
htmd.config(viewer='vmd')
model.viewStates(ligand='resname MOL and noh')
mols = model.getStates()
print(mols)

Each state is a bunch of binding poses (conformations) which were given by the HMM, which said that they were kinetically different.

Now we want to obtain quantitative results about the kinetics between states. We use the Kinetics constructor, where we indicate the system temperature and ligand concentration.

In [None]:
kin = Kinetics(mols[1], temperature=298, concentration=0.0037)

r = kin.getRates()
print(r.g0eq)

We plot the free energies and mean first passage times of all state:

In [None]:
kin.plotRates(rates=('g0eq'))

In [None]:
kin.plotFluxPathways()