# **Prepare_MDM: Prepare Data from BindingDB for the Molegro Data Modeller**

This Jupyter Notebook reads a CSV (comma-separated value) file with energy terms and ligand descriptors determined using Molegro Data Modeller (MVD) ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e); [Bitencourt-Ferreira & de Azevedo, 2019](https://doi.org/10.1007/978-1-4939-9752-7_10)) and adds binding affinity data. It needs two input files: one with binding affinity data from the BindingDB ([Liu et al., 2007](https://doi.org/10.1093/nar/gkl999); [Liu et al., 2025](https://doi.org/10.1093/nar/gkae1075)) and another with ligand data generated with the program MDM (e.g., during a docking screen with ligands for which binding affinity data is available). All ligands in the CSV file generated with the program MVD should be in the input file obtained from BindingDB. The code prepare_MDM will merge the two files and output a CSV file with ligand information and binding affinity (e.g., pKi) for structures of the dataset. We may employ these files to build regression models using the Jupyter Notebook [SKReg4Model (Scikit-Learn Regressors for Modeling)](https://colab.research.google.com/drive/13khGiZAgJeexwNjNDi1fSluQfcRyBCDg) or the regression methods available in the MDM ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e); [Bitencourt-Ferreira & de Azevedo, 2019](https://doi.org/10.1007/978-1-4939-9752-7_10)).
<br> </br>
<img src="https://drive.usercontent.google.com/download?id=1P9cUrTTjl5wAj-Q_jirIQ8opRoQ5c9ja&export=view&authuser=0" width=600 alt="PDB: 2A4L">
<br><i>Structure of a protein-ligand complex ([de Azevedo et al., 1997](https://doi.org/10.1111/j.1432-1033.1997.0518a.x)) with an inhibitor bound to the macromolecule (PDB access code: [2A4L](https://www.rcsb.org/structure/2A4L)). MVD ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e)) generated this figure.</i></br>
<br></br>
**References**
<br></br>
Bitencourt-Ferreira G, de Azevedo WF Jr. Molegro Virtual Docker for Docking. Methods Mol Biol. 2019;2053:149-167. PMID: 31452104. [DOI: 10.1007/978-1-4939-9752-7_10](https://doi.org/10.1007/978-1-4939-9752-7_10) [PubMed](https://pubmed.ncbi.nlm.nih.gov/31452104/)
<br></br>
De Azevedo WF, Leclerc S, Meijer L, Havlicek L, Strnad M, Kim SH. Inhibition of cyclin-dependent kinases by purine analogues: crystal structure of human cdk2 complexed with roscovitine. Eur J Biochem. 1997; 243(1-2): 518-26.
PMID: 9030780.
[DOI: 10.1111/j.1432-1033.1997.0518a.x](https://doi.org/10.1111/j.1432-1033.1997.0518a.x) [PubMed](https://pubmed.ncbi.nlm.nih.gov/9030780/)
<br></br>
De Azevedo WF Jr, Quiroga R, Villarreal MA, da Silveira NJF, Bitencourt-Ferreira G, da Silva AD, Veit-Acosta M, Oliveira PR, Tutone M, Biziukova N, Poroikov V, Tarasova O, Baud S. SAnDReS 2.0: Development of machine-learning models to explore the scoring function space. J Comput Chem. 2024; 45(27): 2333-2346. PMID: 38900052. [DOI: 10.1002/jcc.27449](https://doi.org/10.1002/jcc.27449) [PubMed](https://pubmed.ncbi.nlm.nih.gov/38900052/)
<br></br>
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007; 35(Database issue): D198-201.  PMID: 17145705.
[DOI: 10.1093/nar/gkl999](https://doi.org/10.1093/nar/gkl999)
[PubMed](https://pubmed.ncbi.nlm.nih.gov/17145705/)
<br></br>
Liu T, Hwang L, Burley SK, Nitsche CI, Southan C, Walters WP, Gilson MK. BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data. Nucleic Acids Res. 2025; 53(D1): D1633-D1644. PMID: 39574417.
[DOI: 10.1093/nar/gkae1075](https://doi.org/10.1093/nar/gkae1075)
[PubMed](https://pubmed.ncbi.nlm.nih.gov/39574417/)
<br></br>
Thomsen R, Christensen MH. MolDock: a new technique for high-accuracy molecular docking. J Med Chem. 2006; 49(11): 3315-21. [DOI: 10.1021/jm051197e](https://doi.org/10.1021/jm051197e) [PubMed](https://pubmed.ncbi.nlm.nih.gov/16722650/)
<br></br>

It follows the complete Python code.

In [None]:
#!/usr/bin/env python3
#
################################################################################
# Dr. Walter F. de Azevedo, Jr.                                                #
# [Scopus](https://www.scopus.com/authid/detail.uri?authorId=7006435557)       #
# [GitHub](https://github.com/azevedolab)                                      #
# January 12, 2025                                                             #
################################################################################
#
# Import section
import csv, requests, sys, warnings
import pandas as pd
import numpy as np

# Ignore warnings
warnings.filterwarnings("ignore")

################################################################################
# Define MVD() class                                                           #
################################################################################
class MVD(object):
    """Class to create a CSV file with MVD data and BindingDB affinity
    data.

    It has the following attributes:
    bind_csv_in (string):           Input CSV file
    bind_csv_id (string):           Google drive identification for a CSV file
    bind_in (string):               Type of binding affinty
    mvd_in (string)                 Input CSV file with descritors and energy
                                    terms determined with MVD
    mvd_id (string):                Google drive identification for a CSV file
    mvd_out (string):               Output CSV file with MVD data and binding
                                    affinity information
        """
    # Define constructor method
    def __init__(self,bind_csv_in,bind_csv_id,bind_in,mvd_in,mvd_id):
        """Constructor method"""
        # Define attributes
        self.bind_csv_in = bind_csv_in
        self.bind_csv_id = bind_csv_id
        self.bind_in = bind_in
        self.mvd_in = mvd_in
        self.mvd_id = mvd_id
        self.mvd_out = self.mvd_in.replace(".csv","_Binding_Affinity.csv")

        # Define additional strings
        self.drive_string1 = "https://drive.usercontent.google.com/u/0/uc?id="
        self.drive_string2 = "&export=download"

    # Define read() method
    def read(self):
        """Method to read a CSV file generated with MVD."""
        # Download CSV
        msg_out = "\nDownloading "+self.mvd_in
        print(msg_out,end="...")
        mvd_url = self.drive_string1+self.mvd_id+self.drive_string2
        mvd = requests.get(mvd_url, allow_redirects=True)
        open("/content/"+self.mvd_in, "wb").write(mvd.content)
        print("done!")

        # Try to open mvd_in
        try:
            fo_mvd = open("/content/"+self.mvd_in,"r")
            csv_mvd = csv.reader(fo_mvd)
        except IOError:
            msg_out = "\nIOError! I can't find file "+"/content/"+self.mvd_in
            sys.exit(msg_out)

        # Looping through csv_mvd
        for line2 in csv_mvd:
            i = 0
            # Looping through line2
            for aux in line2:
                if aux.strip() == "SMILES":
                    self.i_SMILES = i
                elif aux.strip() == "Complex":
                    self.i_Cpx = i
                elif aux.strip() == "Filename":
                    self.i_Filename = i
                elif aux.strip() == "Ligand":
                    self.i_Ligand = i
                elif aux.strip() == "Path":
                    self.i_Path = i
                elif aux.strip() == "RMSD":
                    self.i_RMSD = i
                elif aux.strip() == "SimilarityScore":
                    self.i_S = i

                # Update i
                i += 1

            # Clean header2
            # Warning!
            # This code keeps the labels used in MVD except for
            # "E-Intra (tors, ligand atoms)".
            # We replaced it for "E-Intra(tors-ligand atoms)".
            self.header2 = str(line2[self.i_SMILES+1:self.i_Cpx])+","
            self.header2 += str(line2[self.i_Cpx+1:self.i_Filename])+","
            self.header2 += str(line2[self.i_Filename+1:self.i_Ligand])+","
            self.header2 += str(line2[self.i_Ligand+1:self.i_Path])+","
            self.header2 += str(line2[self.i_Path+1:self.i_RMSD])+","
            self.header2 += str(line2[self.i_RMSD+1:self.i_S])+","
            self.header2 += str(line2[self.i_S+1:])
            self.header2 = self.header2.replace("[","").replace(" ","").\
            replace("]","").replace("'","").\
            replace("Cofactor(VdW)","Cofactor (VdW)").\
            replace("Cofactor(elec)","Cofactor (elec)").\
            replace("Cofactor(hbond)","Cofactor (hbond)").\
            replace("E-Inter(cofactor-ligand)","E-Inter (cofactor - ligand)").\
            replace("E-Inter(protein-ligand)","E-Inter (protein - ligand)").\
            replace("E-Inter(water-ligand)","E-Inter (water - ligand)").\
            replace("E-Intertotal","E-Inter total").\
            replace("E-Intra(clash)","E-Intra (clash)").\
            replace("E-Intra(elec)","E-Intra (elec)").\
            replace("E-Intra(hbond)","E-Intra (hbond)").\
            replace("E-Intra(sp2-sp2)","E-Intra (sp2-sp2)").\
            replace("E-Intra(steric)","E-Intra (steric)").\
            replace("E-Intra(tors)","E-Intra (tors)").\
            replace("E-Intra(tors,ligandatoms)","E-Intra (tors-ligand atoms)").\
            replace("E-Intra(vdw)","E-Intra (vdw)").\
            replace("E-SoftConstraintPenalty","E-Soft Constraint Penalty").\
            replace("VdW(LJ12-6)","VdW (LJ12-6)")

            break

        # Close fo_mvd to re-open it into the next loop
        fo_mvd.close()

        # Download CSV
        msg_out = "\nDownloading "+self.bind_csv_in
        print(msg_out,end="...")
        bind_url = self.drive_string1+self.bind_csv_id+self.drive_string2
        bind = requests.get(bind_url, allow_redirects=True)
        open("/content/"+self.bind_csv_in, "wb").write(bind.content)
        print("done!")

        # Read a CSV file (binding affinity)
        self.affinity_data = pd.read_csv("/content/"+self.bind_csv_in,
                                                                delimiter=",")
        self.exp_ligs = self.affinity_data.iloc[:,0]
        self.exp_bind = self.affinity_data.iloc[:,1]

    # Define merge() method
    def merge(self):
        """Method to merge BindingDB and MVD data. It adds binding affinity data
        to a MVD result file."""
        # Merge data
        msg_out = "\n\nMerging data"
        print(msg_out,end =  "...")

        # New header
        self.data_out = "BindingDB Reactant_set_id,p"
        self.data_out += self.bind_in.replace(" (nM)",",")+self.header2+"\n"

        # Looping through self.exp_ligs
        self.count_instances = 0
        for i,lig in enumerate(self.exp_ligs):
            # Open mvd_in
            fo_mvd = open("/content/"+self.mvd_in,"r")
            csv_mvd = csv.reader(fo_mvd)

            # Update self.count_instances
            self.count_instances += 1

            # Looping through csv_mvd
            for line2 in csv_mvd:
                if str(lig) in str(line2):
                    # Clean line_out2
                    line_out2 = str(line2[self.i_SMILES+1:self.i_Cpx])+","
                    line_out2 += str(line2[self.i_Cpx+1:self.i_Filename])+","
                    line_out2 += str(line2[self.i_Filename+1:self.i_Ligand])+","
                    line_out2 += str(line2[self.i_Ligand+1:self.i_Path])+","
                    line_out2 += str(line2[self.i_Path+1:self.i_RMSD])+","
                    line_out2 += str(line2[self.i_RMSD+1:self.i_S])+","
                    line_out2 += str(line2[self.i_S+1:])
                    line_out2 = line_out2.replace("[","").replace("]","").\
                                                replace("'","").replace(" ","")

                    # Set up line_out1 with binding affinity data
                    line_out1 = str(lig)+","
                    line_out1 += str(np.log10(float(self.exp_bind[i]))-9)

                    # Add new line
                    self.data_out += line_out1+","+line_out2+"\n"
                    fo_mvd.close()
                    break

        print("done!")

    # Define write() method
    def write(self):
        """Method to write merged BindingDB and MVD data."""
        # Open a new file and write self.data_out
        fo_new = open("/content/"+self.mvd_out,"w")
        fo_new.write(self.data_out)

        # Close fo_new
        fo_new.close()

        # Show message
        msg_out = "\nNumber of instances written to file "+self.mvd_out+" : "
        msg_out += str(self.count_instances)
        print(msg_out)

    # Define summarize() method
    def summarize(self):
        """Method to write a summary of the data."""
        # Show summary
        summary = "\n\n"+59*"#"
        summary += "\n"+"#"+24*" "+" SUMMARY "+24*" "+"#"
        summary += "\n"+59*"#"
        summary += "\nSource of binding affinty: "+self.bind_csv_in
        summary += "\nInput CSV file: "+self.mvd_in
        summary += "\nOutput CSV file: "+self.mvd_out+"\n"
        summary += "Type of binding affinity: "
        summary += self.bind_in.replace(" (nM)","")+"\n"
        summary += "Number of ligands written to output CSV file: "
        summary += str(self.count_instances)+"\n"
        summary += 59*"#"
        summary += "\n"+"#"+11*" "+" MOLEGRO VIRTUAL DOCKER REFERENCES "
        summary += 11*" "+"#\n"
        summary += 59*"#"
        summary += "\n# DOI:10.1021/jm051197e"+35*" "+"#"
        summary += "\n# DOI:10.1007/978-1-4939-9752-7_10"+24*" "+"#"
        summary += "\n"+59*"#"
        print(summary)

################################################################################
# Define main function                                                         #
################################################################################
def main():
    # Define inputs
    ############################################################################
    # Cyclin-A2/Cyclin-dependent kinase 2 [ 164 ]
    ############################################################################
    # MVD-related files
    mvd_in = "CDK2-CyclinA2_Ki_Plants.csv"
    mvd_id = "1gR_2LqjpkH6NcypeuntHKTENEky_r_t1"    # Drive id for a CSV file
    bind_csv_in = "CDK2-Cyclin_A2_Ki_4Binding.csv"  # Binding-affinity file
    bind_csv_id = "13URHyV6445rNcZ8peG5vEr04w_yj7ouu"  # Drive id for a CSV file
    bind_in = "Ki (nM)"                             # Binding affinity



    # Instantiate an object of MVD() class
    m1 = MVD(bind_csv_in,bind_csv_id,bind_in,mvd_in,mvd_id)

    # Invoke read() method
    m1.read()

    # Invoke merge() method
    m1.merge()

    # Invoke write() method
    m1.write()

    # Invoke summarize() method
    m1.summarize()

################################################################################
# Call main() function                                                         #
################################################################################
main()
################################################################################


Downloading CDK2-CyclinA2_Ki_Plants.csv...done!

Downloading CDK2-Cyclin_A2_Ki_4Binding.csv...done!


Merging data...done!

Number of instances written to file CDK2-CyclinA2_Ki_Plants_Binding_Affinity.csv : 149


###########################################################
#                         SUMMARY                         #
###########################################################
Source of binding affinty: CDK2-Cyclin_A2_Ki_4Binding.csv
Input CSV file: CDK2-CyclinA2_Ki_Plants.csv
Output CSV file: CDK2-CyclinA2_Ki_Plants_Binding_Affinity.csv
Type of binding affinity: Ki
Number of ligands written to output CSV file: 149
###########################################################
#            MOLEGRO VIRTUAL DOCKER REFERENCES            #
###########################################################
# DOI:10.1021/jm051197e                                   #
# DOI:10.1007/978-1-4939-9752-7_10                        #
###########################################################
