# **prepare_MVD: Prepare Data from Molegro Virtual Docker (MVD)**

This Jupyter Notebook reads a CSV (comma-separated value) file with energy terms and ligand descriptors determined using Molegro Virtual Docker (MVD) ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e); [Bitencourt-Ferreira & de Azevedo, 2019](https://doi.org/10.1007/978-1-4939-9752-7_10)) and adds binding affinity data. It needs two input CSV files: one with binding affinity information filtered from the BindingDB ([Liu et al., 2007](https://doi.org/10.1093/nar/gkl999); [Liu et al., 2025](https://doi.org/10.1093/nar/gkae1075)) with [prepare_BindingDB](https://colab.research.google.com/drive/1DNUjJED4zMskHoIgJGJuY-GdRsd7CYeM?usp=sharing) and another with ligand data generated with the program MVD (e.g., during a docking screen with ligands for which binding affinity data is available). All ligands in the CSV file generated with the program MVD should be in the input file obtained from BindingDB (filtered with [prepare_BindingDB](https://colab.research.google.com/drive/1DNUjJED4zMskHoIgJGJuY-GdRsd7CYeM?usp=sharing)). The code prepare_MVD will merge the two CSV files and output a CSV file with ligand information and binding affinity (e.g., pK<sub>i</sub>) for structures of the dataset. It also outputs randomized training and test sets. We may employ these files to build regression models using the Jupyter Notebook [SKReg4Model (Scikit-Learn Regressors for Modeling)](https://colab.research.google.com/drive/13khGiZAgJeexwNjNDi1fSluQfcRyBCDg) or the regression methods available in the Molegro Data Modeller (MDM) ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e); [Bitencourt-Ferreira & de Azevedo, 2019](https://doi.org/10.1007/978-1-4939-9752-7_10)).
<br> </br>
<img src="https://drive.usercontent.google.com/download?id=1KKq50wqwA3InD0ovx0EiG9CLM25BYSYq&export=view&authuser=0" width=600 alt="prepare_MVD">
<br><i>Schematic flowchart for prepare_MVD. It adds binding affinity data to a CSV file generated with MVD ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e); [Bitencourt-Ferreira & de Azevedo, 2019](https://doi.org/10.1007/978-1-4939-9752-7_10)).</i></br>
<br></br>
**References**
<br></br>
Bitencourt-Ferreira G, de Azevedo WF Jr. Molegro Virtual Docker for Docking. Methods Mol Biol. 2019;2053:149-167. PMID: 31452104. [DOI: 10.1007/978-1-4939-9752-7_10](https://doi.org/10.1007/978-1-4939-9752-7_10) [PubMed](https://pubmed.ncbi.nlm.nih.gov/31452104/)
<br></br>
De Azevedo WF Jr, Quiroga R, Villarreal MA, da Silveira NJF, Bitencourt-Ferreira G, da Silva AD, Veit-Acosta M, Oliveira PR, Tutone M, Biziukova N, Poroikov V, Tarasova O, Baud S. SAnDReS 2.0: Development of machine-learning models to explore the scoring function space. J Comput Chem. 2024; 45(27): 2333-2346. PMID: 38900052. [DOI: 10.1002/jcc.27449](https://doi.org/10.1002/jcc.27449) [PubMed](https://pubmed.ncbi.nlm.nih.gov/38900052/)
<br></br>
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007; 35(Database issue): D198-201.  PMID: 17145705.
[DOI: 10.1093/nar/gkl999](https://doi.org/10.1093/nar/gkl999)
[PubMed](https://pubmed.ncbi.nlm.nih.gov/17145705/)
<br></br>
Liu T, Hwang L, Burley SK, Nitsche CI, Southan C, Walters WP, Gilson MK. BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data. Nucleic Acids Res. 2025; 53(D1): D1633-D1644. PMID: 39574417.
[DOI: 10.1093/nar/gkae1075](https://doi.org/10.1093/nar/gkae1075)
[PubMed](https://pubmed.ncbi.nlm.nih.gov/39574417/)
<br></br>
Thomsen R, Christensen MH. MolDock: a new technique for high-accuracy molecular docking. J Med Chem. 2006; 49(11): 3315-21. [DOI: 10.1021/jm051197e](https://doi.org/10.1021/jm051197e) [PubMed](https://pubmed.ncbi.nlm.nih.gov/16722650/)
<br></br>

It follows the complete Python code.

In [1]:
#!/usr/bin/env python3
#
################################################################################
# Dr. Walter F. de Azevedo, Jr.                                                #
# [Scopus](https://www.scopus.com/authid/detail.uri?authorId=7006435557)       #
# [GitHub](https://github.com/azevedolab)                                      #
# January 12, 2025                                                             #
################################################################################
#
# Import section
import csv, random, requests, sys, warnings
import pandas as pd
import numpy as np

# Ignore warnings
warnings.filterwarnings("ignore")

################################################################################
# Define MVD() class                                                           #
################################################################################
class MVD(object):
    """Class to create a CSV file with MVD docking results and BindingDB
    affinity data.

    It has the following attributes:
    bind_csv_in (string):           Input CSV file
    bind_csv_id (string):           Google drive identification for a CSV file
    bind_in (string):               Type of binding affinty
    mvd_in (string)                 Input CSV file with descritors and energy
                                    terms determined with MVD
    mvd_id (string):                Google drive identification for a CSV file
    test_size_in (float):           Percentage of test set
    seed_in (int):                  Random seed
    mvd_out (string):               Output CSV file with MVD data and binding
                                    affinity information
        """
    # Define constructor method
    def __init__(self,bind_csv_in,bind_csv_id,bind_in,mvd_in,mvd_id,
                                                        test_size_in,seed_in):
        """Constructor method"""
        # Define attributes
        self.bind_csv_in = bind_csv_in
        self.bind_csv_id = bind_csv_id
        self.bind_in = bind_in
        self.mvd_in = mvd_in
        self.mvd_id = mvd_id
        self.test_size_in = test_size_in
        self.seed_in = seed_in

        # Define empty lists and additional strings
        self.lig_code_test = []
        self.lig_code_train = []
        self.mvd_out = self.mvd_in.replace(".csv","_Binding_Affinity.csv")
        self.drive_string1 = "https://drive.usercontent.google.com/u/0/uc?id="
        self.drive_string2 = "&export=download"

    # Define read() method
    def read(self):
        """Method to read a CSV file generated with MVD."""
        # Download CSV
        msg_out = "\nDownloading "+self.mvd_in
        print(msg_out,end="...")
        mvd_url = self.drive_string1+self.mvd_id+self.drive_string2
        mvd = requests.get(mvd_url, allow_redirects=True)
        open("/content/"+self.mvd_in, "wb").write(mvd.content)
        print("done!")

        # Try to open mvd_in
        try:
            fo_mvd = open("/content/"+self.mvd_in,"r")
            csv_mvd = csv.reader(fo_mvd)
        except IOError:
            msg_out = "\nIOError! I can't find file "+"/content/"+self.mvd_in
            sys.exit(msg_out)

        # Looping through csv_mvd
        for line2 in csv_mvd:
            i = 0
            # Looping through line2
            for aux in line2:
                if aux.strip() == "SMILES":
                    self.i_SMILES = i
                elif aux.strip() == "Complex":
                    self.i_Cpx = i
                elif aux.strip() == "Filename":
                    self.i_Filename = i
                elif aux.strip() == "Ligand":
                    self.i_Ligand = i
                elif aux.strip() == "Path":
                    self.i_Path = i
                elif aux.strip() == "RMSD":
                    self.i_RMSD = i
                elif aux.strip() == "SimilarityScore":
                    self.i_S = i

                # Update i
                i += 1

            # Clean header2
            # Warning!
            # This code keeps the labels used in MVD except for
            # "E-Intra (tors, ligand atoms)".
            # We replaced it for "E-Intra(tors-ligand atoms)".
            self.header2 = str(line2[self.i_SMILES+1:self.i_Cpx])+","
            self.header2 += str(line2[self.i_Cpx+1:self.i_Filename])+","
            self.header2 += str(line2[self.i_Filename+1:self.i_Ligand])+","
            self.header2 += str(line2[self.i_Ligand+1:self.i_Path])+","
            self.header2 += str(line2[self.i_Path+1:self.i_RMSD])+","
            self.header2 += str(line2[self.i_RMSD+1:self.i_S])+","
            self.header2 += str(line2[self.i_S+1:])
            self.header2 = self.header2.replace("[","").replace(" ","").\
            replace("]","").replace("'","").\
            replace("Cofactor(VdW)","Cofactor (VdW)").\
            replace("Cofactor(elec)","Cofactor (elec)").\
            replace("Cofactor(hbond)","Cofactor (hbond)").\
            replace("E-Inter(cofactor-ligand)","E-Inter (cofactor - ligand)").\
            replace("E-Inter(protein-ligand)","E-Inter (protein - ligand)").\
            replace("E-Inter(water-ligand)","E-Inter (water - ligand)").\
            replace("E-Intertotal","E-Inter total").\
            replace("E-Intra(clash)","E-Intra (clash)").\
            replace("E-Intra(elec)","E-Intra (elec)").\
            replace("E-Intra(hbond)","E-Intra (hbond)").\
            replace("E-Intra(sp2-sp2)","E-Intra (sp2-sp2)").\
            replace("E-Intra(steric)","E-Intra (steric)").\
            replace("E-Intra(tors)","E-Intra (tors)").\
            replace("E-Intra(tors,ligandatoms)","E-Intra (tors-ligand atoms)").\
            replace("E-Intra(vdw)","E-Intra (vdw)").\
            replace("E-SoftConstraintPenalty","E-Soft Constraint Penalty").\
            replace("VdW(LJ12-6)","VdW (LJ12-6)")

            break

        # Close fo_mvd to re-open it into the next loop
        fo_mvd.close()

        # Download CSV
        msg_out = "\nDownloading "+self.bind_csv_in
        print(msg_out,end="...")
        bind_url = self.drive_string1+self.bind_csv_id+self.drive_string2
        bind = requests.get(bind_url, allow_redirects=True)
        open("/content/"+self.bind_csv_in, "wb").write(bind.content)
        print("done!")

        # Read a CSV file (binding affinity)
        self.affinity_data = pd.read_csv("/content/"+self.bind_csv_in,
                                                                delimiter=",")
        self.exp_ligs = self.affinity_data.iloc[:,0]
        self.exp_bind = self.affinity_data.iloc[:,1]

    # Define merge() method
    def merge(self):
        """Method to merge BindingDB and MVD data. It adds binding affinity data
        to a MVD result file."""
        # Merge data
        msg_out = "\n\nMerging data"
        print(msg_out,end =  "...")

        # New header
        self.data_out = "BindingDB Reactant_set_id,p"
        self.data_out += self.bind_in.replace(" (nM)",",")+self.header2+"\n"

        # Looping through self.exp_ligs
        self.count_instances = 0
        for i,lig in enumerate(self.exp_ligs):
            # Open mvd_in
            fo_mvd = open("/content/"+self.mvd_in,"r")
            csv_mvd = csv.reader(fo_mvd)

            # Update self.count_instances
            self.count_instances += 1

            # Looping through csv_mvd
            for line2 in csv_mvd:
                if str(lig) in str(line2):
                    # Clean line_out2
                    line_out2 = str(line2[self.i_SMILES+1:self.i_Cpx])+","
                    line_out2 += str(line2[self.i_Cpx+1:self.i_Filename])+","
                    line_out2 += str(line2[self.i_Filename+1:self.i_Ligand])+","
                    line_out2 += str(line2[self.i_Ligand+1:self.i_Path])+","
                    line_out2 += str(line2[self.i_Path+1:self.i_RMSD])+","
                    line_out2 += str(line2[self.i_RMSD+1:self.i_S])+","
                    line_out2 += str(line2[self.i_S+1:])
                    line_out2 = line_out2.replace("[","").replace("]","").\
                                                replace("'","").replace(" ","")

                    # Set up line_out1 with binding affinity data
                    line_out1 = str(lig)+","
                    line_out1 += str(-np.log10(float(self.exp_bind[i])*(1e-9)))

                    # Add new line
                    self.data_out += line_out1+","+line_out2+"\n"
                    fo_mvd.close()
                    break

        print("done!")

    # Define write() method
    def write(self):
        """Method to write merged BindingDB and MVD data."""
        # Open a new file and write self.data_out
        fo_new = open("/content/"+self.mvd_out,"w")
        fo_new.write(self.data_out)

        # Close fo_new
        fo_new.close()

        # Show message
        msg_out = "\nNumber of instances written to file "+self.mvd_out+" : "
        msg_out += str(self.count_instances)
        print(msg_out)

    # Define randomize() method to automatically determine the structures
    # for training and test sets
    def randomize(self):
        """Method to generate the codes for training and test sets"""
        # Try to open self.mvd_out
        try:
            file2open = "/content/"+self.mvd_out
            data = np.genfromtxt(file2open,skip_header=1,delimiter=",")
            rows = data[:,8]
            n_rows = len(rows)
            fo_data = open(file2open,"r")
            csv_data = csv.reader(fo_data)

            # Set up seed for pseudo-random number generator
            random.seed(a=self.seed_in, version=2)

            # Set up empty list
            test_rows = []

            # Assign zero to i
            i = 0

            # Get unique integers
            while i < int(self.test_size_in*n_rows):
                # Generate pseudo-random
                n = random.randint(0,n_rows)

                # Check if n is in the list
                if n not in test_rows:

                    # Append number
                    test_rows.append(n)

                    # Update i
                    i += 1

            # Assign zero to i
            i_line = 0

            # Looping through csv_data to jump first line
            for line in csv_data:
                break

            # Looping through csv_data
            for line in csv_data:
                # Split
                if i_line in test_rows:
                    self.lig_code_test.append(line[0])
                else:
                    self.lig_code_train.append(line[0])

                # Update
                i_line += 1

            # Close file (dataset)
            fo_data.close()

            # Get number of instances
            count_test = len(self.lig_code_test)
            count_train = len(self.lig_code_train)

            # Show summary
            summary = "\nTraining set ligands:\n"
            summary += str(self.lig_code_train).replace("[","").replace("]","").\
                                            replace("'","").replace(" ","")
            summary += "\n\nTest set ligands:\n"
            summary += str(self.lig_code_test).replace("[","").replace("]","").\
                                            replace("'","").replace(" ","")
            print(summary)

            # Invoke write_codes() method
            self.write_codes()

        # Handle IOError
        except IOError:
            msg_out = "\nI can't find "+file2open+" file!"
            print(msg_out)
            return

    # Define generate() method to split a dataset
    def generate(self):
        """Method to split dataset in training and test sets"""
        # Call read_codes() method
        self.read_codes()

        # Try to open self.mvd_out
        try:
            file2open = "/content/"+self.mvd_out
            fo_data = open(file2open,"r")
            csv_data = csv.reader(fo_data)

            # Set up empty lists
            test_set = []
            training_set = []

            # Looping through csv_data to get header
            for line in csv_data:

                # Some editing 1
                line_out = str(line)
                line_out = line_out.replace("[","").replace("]","").\
                        replace("'","").replace(" ,",",") .replace(", ",",")
                header = str(line_out)
                break

            # Looping through csv_data
            for line in csv_data:

                # Some editing 2
                line_out = str(line)
                line_out = line_out.replace("[","").replace("]","").\
                                            replace("'","").replace(" ","")

                # Split
                if line[0] in self.lig_code_test:
                    test_set.append(line_out)
                elif line[0] in self.lig_code_train:
                    training_set.append(line_out)
                else:
                    print("\nStructure "+str(line[0])+" not in the datasets!")

            # Close file (dataset)
            fo_data.close()

            # Open new file (training set)
            file2create1 = "/content/"+self.mvd_out.replace(".csv","")
            file2create1 += "_Training_Set.csv"
            f_train = open(file2create1,"w")

            # Write header for training set
            f_train.write(header+"\n")

            # Assign zero to count_train and count_test
            count_train = 0
            count_test = 0

            # Looping through training set
            for line in training_set:
                f_train.write(line+"\n")
                count_train += 1

            # Close file (training set)
            f_train.close()

            # Open new file (test set)
            file2create2 = "/content/"+self.mvd_out.replace(".csv","")
            file2create2 += "_Test_Set.csv"
            f_test = open(file2create2,"w")

            # Write header for test set
            f_test.write(header+"\n")

            # Looping through test set
            for line in test_set:
                f_test.write(line+"\n")
                count_test += 1

            # Close file (test set)
            f_test.close()

            # Show summary
            summary = "\nTest set file: "+file2create2
            summary += "\nTraining set file: "+file2create1
            summary+="\nNumber of structures in training set: "+str(count_train)
            summary += "\nNumber of structures in test set: "+str(count_test)
            print(summary)

        # Handle IOError
        except IOError:
            msg_out = "\nI can't find "+file2open+" file!"
            print(msg_out)
            return

    # Define read_codes() method
    def read_codes(self):
        """Method to read codes in the codes_training_set.csv
        and codes_test_set.csv files"""
        # Try to open codes_training_set.csv and codes_test_set.csv
        try:
            file2open1 = "/content/"+"codes_training_set.csv"
            fo_train = open(file2open1,"r")
            csv_train = csv.reader(fo_train)
            file2open2 = "/content/"+"codes_test_set.csv"
            fo_test = open(file2open2,"r")
            csv_test = csv.reader(fo_test)

            # Looping through csv_train
            lig_code = ""
            for line1 in csv_train:

                # Some editing
                aux_line = str(line1).replace("'","").replace(" ","").\
                                                replace("[","").replace("]","")

                # Get codes
                for char1 in aux_line:
                    if char1 != ",":
                        lig_code += char1
                    else:
                        self.lig_code_train.append(lig_code)
                        lig_code = ""

            # Add last code
            self.lig_code_train.append(lig_code)

            # Looping through csv_test
            lig_code = ""
            for line2 in csv_test:
                # Some editing
                aux_line = str(line2)
                aux_line = aux_line.replace("'","").replace(" ","").\
                                                replace("[","").replace("]","")

                # Get codes
                for char2 in aux_line:
                    if char2 != ",":
                        lig_code += char2
                    else:
                        self.lig_code_test.append(lig_code)
                        lig_code = ""

            # Add last code
            self.lig_code_test.append(lig_code)

            # Close files
            fo_train.close()
            fo_test.close()

        # Handle IOError
        except IOError:
            print("\nIOError! I can't find CSV file!")

    # Define write_codes()
    def write_codes(self):
        """Method to write codes for structures in the training
        and test sets"""
        # Open new file for training set
        file2create_training = "/content/"+"codes_training_set.csv"
        fo_training = open(file2create_training,"w")

        # Write codes for training set
        # Write first code
        fo_training.write(str(self.lig_code_train[0]))

        # Looping through self.lig_code_train to write the remaining codes
        for lig_code in self.lig_code_train[1:]:
            fo_training.write(","+str(lig_code))

        # Close file
        fo_training.close()

        # Open new file for test set
        file2create_test = "/content/"+"codes_test_set.csv"
        fo_test = open(file2create_test,"w")

        # Write codes for test set
        # Write first code
        fo_test.write(str(self.lig_code_test[0]))

        # Looping through self.lig_code_test to write the remaining codes
        for lig_code in self.lig_code_test[1:]:
            fo_test.write(","+str(lig_code))

        # Close file
        fo_test.close()

    # Define summarize() method
    def summarize(self):
        """Method to write a summary of the data."""
        # Show summary
        summary = "\n\n"+59*"#"
        summary += "\n"+"#"+24*" "+" SUMMARY "+24*" "+"#"
        summary += "\n"+59*"#"
        summary += "\nSource of binding affinty: "+self.bind_csv_in
        summary += "\nInput CSV file: "+self.mvd_in
        summary += "\nOutput CSV file: "+self.mvd_out+"\n"
        summary += "Type of binding affinity: "
        summary += self.bind_in.replace(" (nM)","")+"\n"
        summary += "Number of ligands written to output CSV file: "
        summary += str(self.count_instances)+"\n"
        summary += 59*"#"
        summary += "\n"+"#"+11*" "+" MOLEGRO VIRTUAL DOCKER REFERENCES "
        summary += 11*" "+"#\n"
        summary += 59*"#"
        summary += "\n# DOI:10.1021/jm051197e"+35*" "+"#"
        summary += "\n# DOI:10.1007/978-1-4939-9752-7_10"+24*" "+"#"
        summary += "\n"+59*"#"
        print(summary)

################################################################################
# Define main function                                                         #
################################################################################
def main():
    # Define inputs
    ############################################################################
    # Cyclin-A2/Cyclin-dependent kinase 2 [ 164 ]
    ############################################################################
    mvd_in = "CDK2-CyclinA2_Ki_Plants.csv"          # CSV file
    mvd_id = "1gR_2LqjpkH6NcypeuntHKTENEky_r_t1"    # Drive id for a CSV file
    bind_csv_in = "CDK2-Cyclin_A2_Ki_4Binding.csv"  # Binding-affinity file
    bind_csv_id = "13URHyV6445rNcZ8peG5vEr04w_yj7ouu"  # Drive id for a CSV file
    bind_in = "Ki (nM)"                             # Binding affinity
    test_size_in = 0.2                              # Test set size
    seed_in = 271828                                # Random seed

    # Instantiate an object of MVD() class
    m1 = MVD(bind_csv_in,bind_csv_id,bind_in,mvd_in,mvd_id,test_size_in,seed_in)

    # Invoke read() method
    m1.read()

    # Invoke merge() method
    m1.merge()

    # Invoke write() method
    m1.write()

    # Invoke randomize() method
    m1.randomize()

    # Invoke generate() method
    m1.generate()

    # Invoke summarize() method
    m1.summarize()

################################################################################
# Call main() function                                                         #
################################################################################
main()
################################################################################


Downloading CDK2-CyclinA2_Ki_Plants.csv...done!

Downloading CDK2-Cyclin_A2_Ki_4Binding.csv...done!


Merging data...done!

Number of instances written to file CDK2-CyclinA2_Ki_Plants_Binding_Affinity.csv : 149

Training set ligands:
2579,8336,17140,50443449,50235342,241331,50279733,50425005,50570309,50501599,35641,50570317,50279763,50443450,50501588,50501587,50570295,50570310,50567655,17054,50567666,35654,50443451,241208,50113281,50570314,50443448,50570320,35645,241335,50443452,8054,35656,50443454,412656,50501606,241213,50501581,50501584,50570294,50570308,50501589,50443447,50464040,50567665,50443444,35664,35643,50279744,35655,50501585,50443443,35663,50279724,50501605,50464037,50501608,50425002,50501590,50501612,241226,50443458,35642,241234,50501593,50570321,35661,50501579,241235,50570313,35650,81436,35658,50443456,50501596,50501611,50501591,241223,50570296,8338,412614,50501604,50443445,50570307,241338,50570303,8339,241216,412658,50570300,50501597,35660,50570312,50570299,241333,502797