# Protein-ligand docking
#### Code adapted from Catalina Arnaiz

## 1. Protein Sanitization

In order to "sanitize" or clean the protein from undesireable ligands the LePhar molecular docking tool **LePro** will be implemented. LePro is designed to automatically add hydrogen atoms to proteins and/or nucleic acids by explicitely considering the protonation state of histidine. All crystal water, ions, small ligands and cofactors except HEM are also removed. 

LePro software download: http://www.lephar.com/software.htm 

In [7]:
! ./lepro_mac Output_docking\Protein1\Protein1.pdb
! ./lepro_linux_x86

************************************************************
*      LePro                                               *
*            Add hydrogen atoms to a protein &             *
*            write the input file for LeDock               *
*      Copyright (C) 2013-21 Hongtao Zhao, PhD             *
*      Email: htzhaovv@gmail.com                           *
************************************************************
----------Usage:                                                                       
          lepro [PDB file] [-rot || -metal || -p]                                        
          -rot  [[chain] resid] align principal axes of the binding site with Cartesian
          -metal keep ZN/MN/CA/MG                                                      
          -metal -p redistribute metal charge to protein                               



Now we will have a "clean" protein pdb file automatically saved as *pro.pdb* and a configuration file that can be used with LeDock (just another method). 

**NOTE**: the running of this cell in Windows is difficult because no actual LePro executable is available for this OS. Nonetheless, you can always prep your protein in Chimera by adding H (Tools > Structure Editing > Add H) and running a structure optimizaiton protocol: 500 steepest descent steps + 200 conjugate gradient steps (Tools > Structure Editing > Minimize Structure). Then proceed to eliminate any excess molecules present in the structure (Select > "whatever" + Actions > Atoms > delete). Instead you can run this from you Virtual Machine.

## 2. Ligand Sanitization - Ligand battery prep

The protein now has a clean structure and H added to it, so it is time to think about the ligands we want to dock onto our protein. Once a list of putative ligands/interactors has been created they need to be prepared for the docking protocol. This is a simple step that can be done using two Python packages *rdkit* and *openbabel*. 

**NOTE**: I encourage you to use VSCode to edit Notebooks, Python or even Ruby scripts. VSCode makes the instalation of packages very simple (at least for Python). I have experienced that working through Anaconda is easier, as it offers the possibility to create environments. I like to create environments for different "jobs", I like to think of it as a way to organize my work in a better way - for protein docking protocols I have created a new *conda environment* containing the two packages mentioned above. 

The following code should be run on the terminal of a VSCode window that has been opened through Anaconda Navigator:
```
!conda create -c rdkit -n my-rdkit-env rdkit
!conda env list #verify the availability of the environments
!conda activate my-rdkit-env
!conda install -c conda-forge openbabel -n my-rdkit-env  ##instal openbabel in your newly created rdkit environment
```

In [1]:
### Install packages
from openbabel import pybel
import pandas as pd
import rdkit.Chem.AllChem as AllChem 
import rdkit.Chem as Chem
from rdkit.Chem import Draw
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from rdkit.Chem import MACCSkeys
from rdkit.Chem import PandasTools
from rdkit import DataStructs
from rdkit.Chem import rdFingerprintGenerator
import math
import matplotlib.pyplot as plt

In [28]:
#load excel file to variable containing for example, metabolomic results 
df = pd.read_csv('./Databases/qsar_ro5_vemuraf_database.csv', header=0) 
print(f"Number of molecules: {df.shape[0]}\n"+
      f"Column names: {df.columns}")
df.head(3)

Number of molecules: 33
Column names: Index(['CanonicalSMILES', 'CID', 'ROMol', 'tanimoto_maccs', 'dice_maccs',
       'tanimoto_morgan', 'dice_morgan', 'MACCS_fp', 'pIC50_pred', 'IC50_pred',
       'molecular_weight', 'n_hba', 'n_hbd', 'logp', 'TPSA', 'ro5_fulfilled'],
      dtype='object')


Unnamed: 0,CanonicalSMILES,CID,ROMol,tanimoto_maccs,dice_maccs,tanimoto_morgan,dice_morgan,MACCS_fp,pIC50_pred,IC50_pred,molecular_weight,n_hba,n_hbd,logp,TPSA,ro5_fulfilled
0,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,58087463,<rdkit.Chem.rdchem.Mol object at 0x00000192367...,0.917808,0.957143,0.7,0.823529,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,6.897704,126.559967,451.101348,5,3,3.241,129.22,True
1,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,58087390,<rdkit.Chem.rdchem.Mol object at 0x00000192367...,0.905405,0.950355,0.7,0.823529,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,6.895076,127.328152,449.085698,5,3,3.3216,129.22,True
2,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)CC2=CNC3=C2C=C(C...,58086030,<rdkit.Chem.rdchem.Mol object at 0x00000192367...,0.914286,0.955224,0.541667,0.702703,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,6.892161,128.1855,399.061982,3,2,4.237,74.85,True


In [29]:
# Add Hydrogens to molecules
mols = [Chem.MolFromSmiles(m) for m in df.CanonicalSMILES]
hmols = [Chem.AddHs(m) for m in mols]

#for mol  in hmols:
#     AllChem.EmbedMolecule(mol,AllChem.ETKDG()) #embed molecules using ETKDG conformer generation method (there are diff. embedding methods)
#     AllChem.UFFOptimizeMolecule(mol,1000) #clean using a force field - Universal Force Field (UFF)

#save info to variables
smiles = list(df.CanonicalSMILES)
cids = list(df.CID)

The following steps create the mol2 file that will be used by the docking program as a metabolite library:

1. Read molecules from SMILES and add a title to each one. I recommend you include an ID that can be traced back to the original metabolite database. It will make the results analysis step much easier. 
2. Create 3D coordinates: make3D adds H (we already have them, but there is no conflict) and performs a quick local optimization  with 50 steps and the MMFF94 forcefield. We want to further refine this optimization, hence why we use localopt with 500 steps. 

In [6]:
out=pybel.Outputfile(filename='./files/docking_vemuraf_ligands.mol2',format='mol2',overwrite=True) #open file to read and write

for index, cid in enumerate(cids):
        mol=pybel.readstring(string=smiles[index],format='smiles') #read mol from SMILE
        print(mol,cid)
        mol.title='mol_'+str(cid)+'_'+str(index) #title = mol_libraryID_index
        mol.addh()
        mol.make3D('mmff94s') #write mols 3D coordinates
        mol.localopt(forcefield='mmff94s', steps=500) #Locally optimize the coordinates

        out.write(mol)
out.close()

CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)c1c[nH]c2c1cc(cn2)CCC(=O)O)F	
 58087463
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)c1c[nH]c2c1cc(cn2)C=CC(=O)O)F	
 58087390
CCCS(=O)(=O)Nc1c(c(c(cc1)F)Cc1c[nH]c2c1cc(cn2)Cl)F	
 58086030
CCCS(=O)(=O)Nc1cc(c(cc1)F)C(=O)c1c[nH]c2c1cc(cn2)c1cn(nc1)C	
 66646723
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)c1c[nH]c2c1cc(cn2)c1nn(cc1)C)F	
 53262272
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)c1c[nH]c2c1cc(cn2)c1ccnn1C)F	
 66647044
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)c1c[nH]c2c1cc(cn2)c1cn(nc1)C)F	
 58087124
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)c1c[nH]c2c1cc(cn2)c1cn(nc1)C)Cl	
 66646554
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)Cc1cc2c(nc1)nc([nH]2)c1ccc(cc1)C)F	
 58095332
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)Cc1cc2c(nc1)nc([nH]2)c1cccc(c1)C)F	
 58095364
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)Cc1cc2c([nH]c(c2Br)C)nc1)F	
 58252053
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)Cc1cc2c([nH]c(c2C)C)nc1)F	
 58251992
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)Cc1cnc2c(c1)cc([nH]2)CC)F	
 58252033
CCCS(=O)(=O)Nc1c(c(c(cc1)F)C(=O)Cc1cc2c([nH]cc2C)nc1)F	
 582520

Now to the even more fun part (lol):

## 3. Docking 

### 3.1. Docking coordinate calculation

There are many methods that can be used to calculate the exact coordinates to which the ligands will be docked to your protein. We will see 3 of them:

1. Manual calculation: this is not the best method as it depends on third-party predictions of putative binding residues like the ones offered by the I-TASSER server. 
2. fpocket: very easy to use, can be installed through conda
3. CB-Dock

#### 3.1.1. Manual calculation

You will need BioPython package for this step, more specifically the PDBParser module. Install the package through Conda:

```
    conda install -c conda-forge biopython
```

Then you will be ready to **parse** any PDB file you want!


In [113]:
### Get coordinates of the binding site center in Chain A (mean of the PDB_4xv2 ligand Dabrafenib)

from Bio.PDB.PDBParser import PDBParser
import statistics

def get_coords(PDB_ligand):
    parser = PDBParser(PERMISSIVE=1)
    structure = parser.get_structure("Vemurafenib", PDB_ligand)
    model = structure[0]
    ligand = model.get_list()[0].get_list()[0]
    
    coord_x=[]
    coord_y=[]
    coord_z=[]

    for atom in ligand:
        coord = (atom.get_coord())
        coord_x.append(coord[0])
        coord_y.append(coord[1])
        coord_z.append(coord[2])
                            
    mean_x = statistics.mean(coord_x)
    mean_y = statistics.mean(coord_y)
    mean_z = statistics.mean(coord_z)

    return(mean_x, mean_y, mean_z)

print("Binding center Chain A:", get_coords("./files/Ligand_3og7.pdb"))

Binding center Chain A: (1.8685151, -2.6376667, -19.917727)


### 3.2. Docking using Smina

Smina is run through a Linux or macOS executable file and just needs to be called adding the required and desired attributes. 

```
    cat $OUTPUT/protein/protein_CB_dock_coords.txt | while IFS=$' ' read x y z j k l m; do ./smina -r $INPUT/protein/protein_cleanH.pdb -l 10_Mols_BEC.mol2 
    -o $OUTPUT/protein/protein_smina.sdf --center_x "$x" --center_y "$y" --center_z "$z" --size_x "$j" --size_y "$k" --size_z "$l" --exhaustiveness 8 
    --num_modes 1 --seed 7683; done

```

As an output smina produces an sdf file containing the 3D coordinates of all docked molecules. You can then load this sdf file onto a pandas dataframe (as I have showed you before) and merge both lists by the library ID that is the comon aspect of both dataframes. Once you have done this, you should sort the docking results by value (most negative to most positive).

The way I like to complete this task is by creating a "cheatsheet" the information on each metabolite:


In [112]:
numbers = list(range(0,5)) #number of metabolites
numbers_ = [] #create name "mol_" + "number" from 0 to # of metabolites

for index, lib in enumerate(cids):
    name='mol_'+str(lib)+'_'+str(index) #molecule descriptor same as above
    numbers_.append(name)

all = pd.DataFrame(list(zip(smiles, cids, numbers_)), columns=["SMILES", "CIDs", "ID"]) #create a dataframe with all available info on each metabolite
print(all.shape)
all.head()

(33, 3)


Unnamed: 0,SMILES,CIDs,ID
0,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,58087463,mol_58087463_0
1,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,58087390,mol_58087390_1
2,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)CC2=CNC3=C2C=C(C...,58086030,mol_58086030_2
3,CCCS(=O)(=O)NC1=CC(=C(C=C1)F)C(=O)C2=CNC3=C2C=...,66646723,mol_66646723_3
4,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,53262272,mol_53262272_4


Once I have this I can load the smina output file and merge both lists by the ID I gave each metabolite at the beginning. Make sure you always match the descriptors on the cheatsheet and on the initial metabolite database so you have something in common in both dataframes for your merge.

In [105]:
smina = PandasTools.LoadSDF('./files/docking_results_vemuraf_3og7.sdf', embedProps=True, molColName=None, removeHs=False)
# print(smina)
smina["minimizedAffinity"] = smina["minimizedAffinity"].astype("float")
smina_grouped = smina.groupby(by = "ID").agg("min")
final_smina = smina_grouped.sort_values("minimizedAffinity")

In [106]:
merge = pd.merge(final_smina, all, on='ID')
merge

Unnamed: 0,ID,minimizedAffinity,SMILES,CIDs
0,mol_66647044_5,-11.21528,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,66647044
1,mol_53262272_4,-11.17635,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,53262272
2,mol_58087124_6,-11.12042,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,58087124
3,mol_58251992_11,-11.10394,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)CC2=CC3=C(N...,58251992
4,mol_66646554_7,-11.08034,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,66646554
5,mol_58251999_28,-10.98881,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)CC2=CN=C3C(...,58251999
6,mol_58095372_20,-10.86553,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)CC2=CC3=C(N...,58095372
7,mol_58087390_1,-10.86148,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C...,58087390
8,mol_58252053_10,-10.84365,CCCS(=O)(=O)NC1=C(C(=C(C=C1)F)C(=O)CC2=CC3=C(N...,58252053
9,mol_144212136_15,-10.82212,CCCS(=O)NC1=C(C(=C(C=C1)F)C(=O)C2=CNC3=C2C=C(C...,144212136
