# **MVD4ML: Molegro Virtual Docker (MVD) for Machine Learning Modeling**

This Jupyter Notebook reads a CSV (comma-separated value) file with energy terms and ligand descriptors determined using Molegro Virtual Docker (MVD) ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e); [Bitencourt-Ferreira & de Azevedo, 2019](https://doi.org/10.1007/978-1-4939-9752-7_10)) and adds binding affinity data calculated by the program SAnDReS 2.0 ([de Azevedo et al., 2024](https://doi.org/10.1002/jcc.27449)). It needs two input CSV files created with SAnDReS 2.0: one with PDB access codes for the structures in the test set and another with ligand data. All ligands in the CSV file generated with the program MVD should be in the input file obtained using SAnDReS 2.0 (with ligand data) ([de Azevedo et al., 2024](https://doi.org/10.1002/jcc.27449)). MVD4ML will merge all files and output two new CSV files with ligand information for structures in the training and test sets. We may employ these files to build regression models using the Jupyter Notebook [SKReg4Model (Scikit-Learn Regressors for Modeling)](https://colab.research.google.com/drive/13khGiZAgJeexwNjNDi1fSluQfcRyBCDg).
<br> </br>
<img src="https://drive.usercontent.google.com/download?id=1P9cUrTTjl5wAj-Q_jirIQ8opRoQ5c9ja&export=view&authuser=0" width=600 alt="PDB: 2A4L">
<br><i>Structure of a protein-ligand complex ([de Azevedo et al., 1997](https://doi.org/10.1111/j.1432-1033.1997.0518a.x)) with an inhibitor bound to the macromolecule (PDB access code: [2A4L](https://www.rcsb.org/structure/2A4L)). MVD ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e)) generated this figure.</i></br>
<br></br>
**References**
<br></br>
Bitencourt-Ferreira G, de Azevedo WF Jr. Molegro Virtual Docker for Docking. Methods Mol Biol. 2019;2053:149-167. PMID: 31452104. [DOI: 10.1007/978-1-4939-9752-7_10](https://doi.org/10.1007/978-1-4939-9752-7_10) [PubMed](https://pubmed.ncbi.nlm.nih.gov/31452104/)
<br></br>
De Azevedo WF, Leclerc S, Meijer L, Havlicek L, Strnad M, Kim SH. Inhibition of cyclin-dependent kinases by purine analogues: crystal structure of human cdk2 complexed with roscovitine. Eur J Biochem. 1997; 243(1-2): 518-26.
PMID: 9030780.
[DOI: 10.1111/j.1432-1033.1997.0518a.x](https://doi.org/10.1111/j.1432-1033.1997.0518a.x) [PubMed](https://pubmed.ncbi.nlm.nih.gov/9030780/)
<br></br>
De Azevedo WF Jr, Quiroga R, Villarreal MA, da Silveira NJF, Bitencourt-Ferreira G, da Silva AD, Veit-Acosta M, Oliveira PR, Tutone M, Biziukova N, Poroikov V, Tarasova O, Baud S. SAnDReS 2.0: Development of machine-learning models to explore the scoring function space. J Comput Chem. 2024; 45(27): 2333-2346. PMID: 38900052. [DOI: 10.1002/jcc.27449](https://doi.org/10.1002/jcc.27449) [PubMed](https://pubmed.ncbi.nlm.nih.gov/38900052/)
<br></br>
Thomsen R, Christensen MH. MolDock: a new technique for high-accuracy molecular docking. J Med Chem. 2006; 49(11): 3315-21. [DOI: 10.1021/jm051197e](https://doi.org/10.1021/jm051197e) [PubMed](https://pubmed.ncbi.nlm.nih.gov/16722650/)
<br></br>

It follows the complete Python code.

In [1]:
#!/usr/bin/env python3
#
################################################################################
# Dr. Walter F. de Azevedo, Jr.                                                #
# [Scopus](https://www.scopus.com/authid/detail.uri?authorId=7006435557)       #
# [GitHub](https://github.com/azevedolab)                                      #
# July 20, 2024                                                                #
################################################################################
#
################################################################################
# Define merge_san_mvd() function                                              #
################################################################################
def merge_san_mvd(san_in,mvd_in):
    """Function to merge a SAnDReS-generated file
    ([de Azevedo et al., 2024](https://doi.org/10.1002/jcc.27449)) with
    an MVD-generated file
    ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e)).
    It adds experimental affinity data in an SAnDReS-generated file to an
    MVD-generate file.
    <br></br>
    **References**
    <br></br>
    De Azevedo WF Jr, Quiroga R, Villarreal MA, da Silveira NJF,
    Bitencourt-Ferreira G, da Silva AD, Veit-Acosta M, Oliveira PR, Tutone M,
    Biziukova N, Poroikov V, Tarasova O, Baud S. SAnDReS 2.0: Development of
    machine-learning models to explore the scoring function space. J Comput
    Chem. 2024; 45(27): 2333-2346. PMID: 38900052.
    [DOI: 10.1002/jcc.27449](https://doi.org/10.1002/jcc.27449)
    [PubMed](https://pubmed.ncbi.nlm.nih.gov/38900052/)
    <br></br>
    Thomsen R, Christensen MH. MolDock: a new technique for high-accuracy
    molecular docking. J Med Chem. 2006; 49(11): 3315-21.
    [DOI: 10.1021/jm051197e](https://doi.org/10.1021/jm051197e)
    [PubMed](https://pubmed.ncbi.nlm.nih.gov/16722650/)
    <br></br>

    """

    ############################################################################
    # Import section
    ############################################################################
    import csv
    import sys

    ############################################################################
    # SAnDReS-generated file
    ############################################################################
    # Define columns of the SAnDReS CSV file
    ref_col_san1 = 9
    ref_col_san2 = 11
    ref_col_san3 = 13
    ref_col_san4 = 16
    ref_col_san5 = 17
    ref_col_san6 = 18
    ref_col_san7 = 24

    # Try to open san_in
    try:
        fo_san = open(san_in,"r")
        csv_san = csv.reader(fo_san)
    except IOError:
        msg_out = "\nIOError! I can't find file "+san_in
        sys.exit(msg_out)

    # Looping through csv_san
    for line1 in csv_san:
        header1 = str(line1[:ref_col_san1])+","
        header1 += str(line1[ref_col_san2:ref_col_san3])+","
        header1 += str(line1[ref_col_san4:ref_col_san5])+","
        header1 += str(line1[ref_col_san6:ref_col_san7])
        header1 = header1.replace("[","").replace("]","").replace("'","").\
                    replace(" ","").replace("AverageQ","Average Q")

        break

    ############################################################################
    # MVD-generated file
    ############################################################################
    # Define mvd_out string
    if "training" in san_in:
        mvd_out = mvd_in.replace(".csv","_Training_Set.csv")
    elif "test" in san_in:
        mvd_out = mvd_in.replace(".csv","_Test_Set.csv")
    else:
        mvd_out = mvd_in.replace(".csv","_binding.csv")

    # Try to open mvd_in
    try:
        fo_mvd = open(mvd_in,"r")
        csv_mvd = csv.reader(fo_mvd)
    except IOError:
        msg_out = "\nIOError! I can't find file "+mvd_in
        sys.exit(msg_out)

    # Looping through csv_mvd
    for line2 in csv_mvd:
        i = 0
        for aux in line2:
            if aux.strip() == "SMILES":
                i_SMILES = i
            elif aux.strip() == "Complex":
                i_Complex = i
            elif aux.strip() == "Filename":
                i_Filename = i
            elif aux.strip() == "Ligand":
                i_Ligand = i
            elif aux.strip() == "Path":
                i_Path = i
            elif aux.strip() == "RMSD":
                i_RMSD = i
            elif aux.strip() == "SimilarityScore":
                i_S = i

            i += 1

        # Clean header2
        # Warning!
        # This code keeps the labels used in MVD except for
        # "E-Intra (tors, ligand atoms)".
        # We replaced it for "E-Intra(tors-ligand atoms)".
        header2 = str(line2[i_SMILES+1:i_Complex])+","
        header2 += str(line2[i_Complex+1:i_Filename])+","
        header2 += str(line2[i_Filename+1:i_Ligand])+","
        header2 += str(line2[i_Ligand+1:i_Path])+","
        header2 += str(line2[i_Path+1:i_RMSD])+","
        header2 += str(line2[i_RMSD+1:i_S])+","
        header2 += str(line2[i_S+1:])
        header2 = header2.replace("[","").replace(" ","").\
        replace("]","").replace("'","").\
        replace("Cofactor(VdW)","Cofactor (VdW)").\
        replace("Cofactor(elec)","Cofactor (elec)").\
        replace("Cofactor(hbond)","Cofactor (hbond)").\
        replace("E-Inter(cofactor-ligand)","E-Inter (cofactor - ligand)").\
        replace("E-Inter(protein-ligand)","E-Inter (protein - ligand)").\
        replace("E-Inter(water-ligand)","E-Inter (water - ligand)").\
        replace("E-Intertotal","E-Inter total").\
        replace("E-Intra(clash)","E-Intra (clash)").\
        replace("E-Intra(elec)","E-Intra (elec)").\
        replace("E-Intra(hbond)","E-Intra (hbond)").\
        replace("E-Intra(sp2-sp2)","E-Intra (sp2-sp2)").\
        replace("E-Intra(steric)","E-Intra (steric)").\
        replace("E-Intra(tors)","E-Intra (tors)").\
        replace("E-Intra(tors,ligandatoms)","E-Intra (tors-ligand atoms)").\
        replace("E-Intra(vdw)","E-Intra (vdw)").\
        replace("E-SoftConstraintPenalty","E-Soft Constraint Penalty").\
        replace("VdW(LJ12-6)","VdW (LJ12-6)")

        break

    # Close fo_mvd to re-open it into the next loop
    fo_mvd.close()

    ############################################################################
    # Merge data
    ############################################################################
    # New header
    data_out = header1+","+header2+"\n"

    # Looping through csv_san
    count_instances = 0
    for line1 in csv_san:
        line_out1 = str(line1[:ref_col_san1])+","
        line_out1 += str(line1[ref_col_san2:ref_col_san3])+","
        line_out1 += str(line1[ref_col_san4:ref_col_san5])+","
        line_out1 += str(line1[ref_col_san6:ref_col_san7])
        line_out1 = line_out1.replace("[","").replace("]","").\
                                        replace("'","").replace(" ","")

        # Open mvd_in
        fo_mvd = open(mvd_in,"r")
        csv_mvd = csv.reader(fo_mvd)

        count_instances += 1

        # Looping through csv_mvd
        for line2 in csv_mvd:
            if line1[2].strip() in str(line2):

                # Clean line_out2
                line_out2 = str(line2[i_SMILES+1:i_Complex])+","
                line_out2 += str(line2[i_Complex+1:i_Filename])+","
                line_out2 += str(line2[i_Filename+1:i_Ligand])+","
                line_out2 += str(line2[i_Ligand+1:i_Path])+","
                line_out2 += str(line2[i_Path+1:i_RMSD])+","
                line_out2 += str(line2[i_RMSD+1:i_S])+","
                line_out2 += str(line2[i_S+1:])
                line_out2 = line_out2.\
                 replace("[","").replace("]","").replace("'","").replace(" ","")
                data_out += line_out1+","+line_out2+"\n"
                fo_mvd.close()
                break

    # Close fo_san
    fo_san.close()

    # Open a new file and write data_out
    fo_new = open(mvd_out,"w")
    fo_new.write(data_out)

    # Close fo_new
    fo_new.close()

    # Show message
    msg_out = "\nNumber of instances written to file "+mvd_out+" : "
    msg_out += str(count_instances)
    print(msg_out)

################################################################################
# Define main function                                                         #
################################################################################
def main():
    ############################################################################
    # Import section
    ############################################################################
    import sys
    import csv
    from urllib.request import urlretrieve

    ############################################################################
    # Define inputs
    ############################################################################
    #
    ############################################################################
    # CDK2 Ki
    ############################################################################
    # SAnDReS-related file
    #san_in = "./CDK2_Ki/cdk2_Ki_sandres.csv"     # File with results obtained with SAnDReS
    #san_url = "https://bit.ly/cdk2_Ki_sandres" # Url with san_in file
    # MVD-related file
    #mvd_in = "./CDK2_Ki/cdk2_Ki_mvd.csv"    	   # File with results determined using MVD
    #mvd_url = "https://bit.ly/cdk2_Ki_mvd"   # Url with mvd_in file
    # PDB-related file
    #pdb_in = "./CDK2_Ki/CDK2_Ki_PDB_Test_Set.csv"     # File with PDB codes for test set
    #pdb_url = "https://bit.ly/CDK2_Ki_PDB_Test_Set" # Url with pdb_in file

    ############################################################################
    # CDK2-Cyclin A2
    ############################################################################
    # BindingDB
    # https://www.bindingdb.org/rwd/jsp/dbsearch/PrimarySearch_ki.jsp?tag=comki&column=KI&complexid=97,50014798&energytern=kJ/mole&kiunit=nM&icunit=nM&submit=Search&target=Cyclin-A2%2FCyclin-dependent+kinase+2
    # Cyclin-A2/Cyclin-dependent kinase 2 [ 164 ]
    # BindingDB: 2024-02-15
    # Binding: Ki
    # Filter my 164 hits
    # Targets 1▿
    # Publications 14▿
    # Institutions 8▿
    # Affinity: 2.9 to 1.3E+4 nM▿
    # Xtal structures: 6
    # Docked structures: 4
    # Catalog Cmpds: 12
    #
    # SAnDReS-related file
    san_in = "cdk2_cyclin_a_ki_sandres.csv"
    san_url = "https://drive.usercontent.google.com/u/0/uc?id=16HyPn_8ZAvgtHFRCYICUGc_Dq-6BvpnM&export=download"

    # MVD-related file
    mvd_in = "cdk2_cyclin_a_ki_mvd.csv"
    mvd_url = "https://drive.usercontent.google.com/u/0/uc?id=1toALRJ2o37jX1iM0YpGNsLoQvYVBYV6b&export=download"

    # PDB-related file
    pdb_in = "cdk2_cyclin_a_ki_pdb_test_set.csv"
    pdb_url = "https://drive.usercontent.google.com/u/0/uc?id=1bi1oDCeZ7zLuZ5GqdepMBv_FG4hKHByi&export=download"

    ############################################################################
    # CDK19 IC50
    ############################################################################
    #san_in = "CDK19_IC50_SAN.csv"     # File with results obtained with SAnDReS
    #san_url = "https://drive.usercontent.google.com/download?id=1FvUlSugUjcUNuR-1E71eXZ5dQRrHangK&export=view&authuser=0" # Url with san_in
                                                             # file
    # MVD-related file
    #mvd_in = "CDK19_IC50_MVD.csv"     # File with results determined using MVD
    #mvd_url = "https://drive.usercontent.google.com/download?id=1GaS9tCaB5VeRTa2eKwGdjNsYDmPpBgCh&export=view&authuser=0"   # Url with
                                                               # mvd_in file

    # PDB-related file
    #pdb_in = "CDK19_IC50_PDB_Test_Set.csv" # File with PDB codes for test set
    #pdb_url = "https://drive.usercontent.google.com/download?id=1Kpay9Da_hxgr5_GEy_bNkAWpbcxSEOwO&export=view&authuser=0" # Url with pdb_in
                                                             # file

    ############################################################################
    # Download files
    ############################################################################
    # Download san_in file
    msg_out = "\n\nDownloading "+san_in
    print(msg_out,end = "...")
    urlretrieve(san_url, san_in)
    san_in = "/content/"+san_in       # For Google Colab environment
    print("done!")

    # Download mvd_in file
    msg_out = "Downloading "+mvd_in
    print(msg_out,end = "...")
    urlretrieve(mvd_url, mvd_in)
    mvd_in = "/content/"+mvd_in       # For Google Colab environment
    print("done!")

    # Download pdb_in file
    msg_out = "Downloading "+pdb_in
    print(msg_out,end = "...")
    urlretrieve(pdb_url, pdb_in)
    pdb_in = "/content/"+pdb_in       # For Google Colab environment
    print("done!")

    ############################################################################
    # Split dataset
    ############################################################################
    # Set up an empty list for pdb access codes
    pdb_lst = []

    # Try to open a csv file
    try:
        fo_pdb = open(pdb_in,"r")
        csv_pdb = csv.reader(fo_pdb)
    except IOError:
        msg_out = "\nIOError! I can't find file "+pdb_in+"!"
        sys.exit(msg_out)

    # Looping through csv_pdb to get PDB codes
    for pdbs in csv_pdb:
        for pdb in pdbs:
            pdb_lst.append(pdb)

    # Close file
    fo_pdb.close()

    # Try to open a csv file
    try:
        fo_bind = open(san_in,"r")
        csv_bind = csv.reader(fo_bind)
    except IOError:
        msg_out = "\nIOError! I can't find file "+san_in+"!"
        sys.exit(msg_out)

    # Get header
    for line in csv_bind:
        training_out = ""

        # Looping through column labels
        for line1 in line:
            training_out += line1.strip()+","

        training_out = training_out[:len(training_out)-1]+"\n"
        test_out = training_out
        break

    # Assign zero to count_train
    count_train = 0

    # Looping through csv_bind
    for line in csv_bind:
        if line[0].strip() in pdb_lst:
            test_out += str(line).replace("[","").replace("]","").\
                        replace("'","").replace(" ","")+"\n"
        else:
            training_out += str(line).replace("[","").replace("]","").\
                        replace("'","").replace(" ","")+"\n"
            count_train += 1

    # Open new files
    training_set_file = san_in.replace(".csv","_training_set.csv")
    test_set_file = san_in.replace(".csv","_test_set.csv")
    fo_training = open(training_set_file,"w")
    fo_test = open(test_set_file,"w")

    # Write data
    fo_training.write(training_out)
    fo_test.write(test_out)

    # Close files
    fo_bind.close()
    fo_training.close()
    fo_test.close()

    # Show message
    msg_out = "\nFile "+training_set_file+" has "+str(count_train)+" instances."
    msg_out += "\nFile "+test_set_file+" has "+str(len(pdb_lst))
    msg_out += " instances."
    print(msg_out)

    ############################################################################
    # Merge datasets
    ############################################################################
    # Call merge_san_mvd() function for the training set
    merge_san_mvd(training_set_file,mvd_in)

    # Call merge() function for the test set
    merge_san_mvd(test_set_file,mvd_in)

################################################################################
# Call main() function                                                         #
################################################################################
main()
################################################################################



Downloading cdk2_cyclin_a_ki_sandres.csv...done!
Downloading cdk2_cyclin_a_ki_mvd.csv...done!
Downloading cdk2_cyclin_a_ki_pdb_test_set.csv...done!

File /content/cdk2_cyclin_a_ki_sandres_training_set.csv has 106 instances.
File /content/cdk2_cyclin_a_ki_sandres_test_set.csv has 45 instances.

Number of instances written to file /content/cdk2_cyclin_a_ki_mvd_Training_Set.csv : 106

Number of instances written to file /content/cdk2_cyclin_a_ki_mvd_Test_Set.csv : 45
