# **prepare_BindingDB: Prepare Data Downloaded from BindingDB**
This program prepares files (structure-data file (SD file) and TSV (tab-separated values)) downloaded from the BindingDB ([Liu et al., 2007](https://doi.org/10.1093/nar/gkl999); [Liu et al., 2025](https://doi.org/10.1093/nar/gkae1075)).
The SD file has the structures of ligands (3D atomic coordinates) for which binding affinity data is available in the BindingDB (https://www.bindingdb.org/) ([Liu et al., 2007](https://doi.org/10.1093/nar/gkl999); [Liu et al., 2025](https://doi.org/10.1093/nar/gkae1075)). The TSV file brings the binding affinity information (e.g., inhibition constant (K<sub>i</sub>)) for each ligand in the SD file. This code removes ligands with undefined binding affinity (e.g., > 10000) and deletes repeated molecules to keep one of the copies. It generates filtered files (SD file and CSV) for docking screens (SD file) and machine-learning modeling (CSV) with Molegro Virtual Docker ([Thomsen & Christensen, 2006](https://doi.org/10.1021/jm051197e); [Bitencourt-Ferreira & de Azevedo, 2019](https://doi.org/10.1007/978-1-4939-9752-7_10)).
<br></br>
<img src="https://drive.usercontent.google.com/download?id=1C5xi2MMLwTUy31o35xw-pg0H9ChcJ4o4&export=view&authuser=0" width=600 alt="prepare_BindingDB">
<br><i>Schematic flowchart of prepate_BindingDB. This program filters and converts files downloaded from the BindingDB ([Liu et al., 2007](https://doi.org/10.1093/nar/gkl999); [Liu et al., 2025](https://doi.org/10.1093/nar/gkae1075)).</i></br>
<br></br>

**References**
<br></br>
Bitencourt-Ferreira G, de Azevedo WF Jr. Molegro Virtual Docker for Docking. Methods Mol Biol. 2019;2053:149-167. PMID: 31452104. [DOI: 10.1007/978-1-4939-9752-7_10](https://doi.org/10.1007/978-1-4939-9752-7_10) [PubMed](https://pubmed.ncbi.nlm.nih.gov/31452104/)
<br></br>
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007; 35(Database issue):D198-201.
[DOI: 10.1093/nar/gkl999](https://doi.org/10.1093/nar/gkl999)
[PubMed](https://pubmed.ncbi.nlm.nih.gov/17145705/)
<br></br>
Liu T, Hwang L, Burley SK, Nitsche CI, Southan C, Walters WP, Gilson MK. BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data. Nucleic Acids Res. 2025; 53(D1):D1633-D1644.
[DOI: 10.1093/nar/gkae1075](https://doi.org/10.1093/nar/gkae1075)
[PubMed](https://pubmed.ncbi.nlm.nih.gov/39574417/)
<br></br>
Thomsen R, Christensen MH. MolDock: a new technique for high-accuracy molecular docking. J Med Chem. 2006; 49(11): 3315-21. [DOI: 10.1021/jm051197e](https://doi.org/10.1021/jm051197e) [PubMed](https://pubmed.ncbi.nlm.nih.gov/16722650/)
<br></br>
It follows the complete Python code.


In [None]:
#!/usr/bin/env python3
#
################################################################################
# Dr. Walter F. de Azevedo, Jr.                                                #
# [Scopus](https://www.scopus.com/authid/detail.uri?authorId=7006435557)       #
# [GitHub](https://github.com/azevedolab)                                      #
# January 12, 2025                                                             #
################################################################################
#
# Python program to clean files (structure-data file (SD file) and TSV
# (tab-separated values) file) downloaded from the BindingDB. It has two
# classes to handle different file formats: TSV (tab-separated values) and SDF
# (structure-data file). It outputs two files (CSV and SDF).
#
# Import section
import numpy as np
import pandas as pd
import os, requests, warnings

# Ignore warnings
warnings.filterwarnings("ignore")

################################################################################
# Define TSV() class                                                           #
################################################################################
class TSV(object):
    """Class to filter a TSV (tab-separated values) file downloaded from the
 BindingDB (https://www.bindingdb.org/). It outputs two files (CSV and SDF).
 The CSV file has ligand IDs and binding affinity values. The SDF file has the
 structures of ligands (3D atomic coordinates) for which binding affinity data
 is available in the BindingDB (https://www.bindingdb.org/).

    It has the following attributes:
    tsv_in (string):                Input TSV downloaded from the BindingDB
    tsv_id (string):                Google drive identification for a TSV file
    bind_in (string):               Type of binding affinty (e.g., "Ki (nM)")
    columns (list):                 List of headers to read from the TSV file
    csv_out (string):               Output CSV (comma-separated values) file

    """
    # Define constructor method
    def __init__(self,tsv_in,tsv_id,bind_in):
        """Constructor method"""
        # Define attributes
        self.bind_in = bind_in
        self.tsv_in = tsv_in
        self.tsv_id = tsv_id
        self.columns = ["BindingDB Reactant_set_id",self.bind_in]
        self.csv_out = "CDK2-Cyclin_A2_BindingDB_"+\
                                            self.bind_in.replace(" (nM)",".csv")

        # Define additional lists and strings
        self.mols2exclude = []
        self.clean_exp_bind = []
        self.data_out = "Ligand,BindingDB Reactant_set_id\n"
        self.drive_string1 = "https://drive.usercontent.google.com/u/0/uc?id="
        self.drive_string2 = "&export=download"

    # Define read() method
    def read(self):
        """Method to read a TSV file."""
        # Download TSV
        msg_out = "\nDownloading "+self.tsv_in
        print(msg_out,end="...")
        tsv_url = self.drive_string1+self.tsv_id+self.drive_string2
        tsv = requests.get(tsv_url, allow_redirects=True)
        open(self.tsv_in, "wb").write(tsv.content)
        print("done!")

        # Read TSV file into DataFrame
        df = pd.read_table("/content/"+self.tsv_in,usecols = self.columns)

        # Get data from a DataFrame and convert it to a temporary CSV file
        tmp_file = "/content/tmp.csv"
        df.to_csv(tmp_file, index=False)
        data_in = pd.read_csv(tmp_file,delimiter=",")
        self.exp_ligs = data_in.iloc[:,0]
        self.exp_bind = data_in.iloc[:,1]

        # Remove temporary file
        os.remove(tmp_file)

        # Looping through exp_ligs
        for i,lig in enumerate(self.exp_ligs):
            # Check whether exp_bind[i] is an NaN
            try:
                float(self.exp_bind[i])
                self.data_out += str(lig)+","+str(self.exp_bind[i])+"\n"
                self.clean_exp_bind.append(float(self.exp_bind[i]))
            except:
                self.mols2exclude.append(lig)

    # Define write() method
    def write(self):
        """Method to write a CSV file."""
        # Write CSV
        fo_new = open("/content/"+self.csv_out,"w")
        fo_new.write(self.data_out)
        fo_new.close()

        # Show message
        msg_out = "\n\nBinding affinity written to CSV file: "+self.csv_out
        print(msg_out)

    # Define summarize() method
    def summarize(self):
        """Method to write a summary of the data."""

        # Show ligands with missing data
        lig_summary = "\n\n"+59*"#"
        lig_summary += "\n"+"#"+15*" "+" LIGANDS WITH MISSING DATA "+15*" "+"#"
        lig_summary += "\n"+59*"#"
        print(lig_summary)

        # Looping through mols2exclude
        for lig in self.mols2exclude:
            lig = "\""+str(lig)+"\","
            print(lig)
        print(59*"#")

        # Show summary
        summary = "\n\n"+59*"#"
        summary += "\n"+"#"+24*" "+" SUMMARY "+24*" "+"#"
        summary += "\n"+59*"#"
        summary += "\nInput TSV file: "+self.tsv_in+"\n"
        summary += "Type of binding affinity: "+self.bind_in+"\n"
        summary += "Number of ligands with NaN for binding affinity: "
        summary += str(len(self.mols2exclude))+"\n"
        summary += "Minimum "+self.bind_in+": "
        summary += str(min(self.clean_exp_bind))+"\n"
        summary += "Maximum "+self.bind_in+": "
        summary += str(max(self.clean_exp_bind))+"\n"
        summary += 59*"#"
        summary += "\n"+"#"+18*" "+" BindingDB REFERENCES "+17*" "+"#\n"
        summary += "#"+15*" "+" https://www.bindingdb.org "+15*" "+"#\n"
        summary += 59*"#"
        summary += "\n# DOI:10.1093/nar/gkl999"+34*" "+"#"
        summary += "\n# DOI:10.1093/nar/gkae1075"+32*" "+"#"
        summary += "\n"+59*"#"
        print(summary)

################################################################################
# Define SDF() class                                                           #
################################################################################
class SDF(object):
    """Class to edit some aspects of an SDF downloaded from the BindingDB. It
    outputs a clean SDF with unique ligand structures.

    It has the following attributes:
    sdf_in (string):                Input SDF downloaded from BindingDB
    sdf_id (string):                Google drive identification for a SD file
    mols2exclude (list):            List of ligands to be excluded from the
                                    output files
    bind_string (string):           String to identify the binding affinty
    sdf_out (string):               Output SDF
    csv_out (string):               Output CSV file

    """
    # Define constructor method
    def __init__(self,sdf_in,sdf_id,mols2exclude,bind_string):
        """Constructor method"""
        # Define attributes
        self.sdf_in = sdf_in
        self.sdf_id = sdf_id
        self.mols2exclude = mols2exclude
        self.bind_string = bind_string
        self.sdf_out = self.sdf_in.replace(".sdf","_4Docking.sdf")
        self.csv_out = self.sdf_out.replace("_4Docking.sdf","_4Binding.csv")

        # Define additional strings
        self.drive_string1 = "https://drive.usercontent.google.com/u/0/uc?id="
        self.drive_string2 = "&export=download"

    # Define clean() method
    def clean(self):
        """Method to clean an SDF."""
        # Download SDF
        msg_out = "\nDownloading "+self.sdf_in
        print(msg_out,end="...")
        sdf_url = self.drive_string1+self.sdf_id+self.drive_string2
        sdf = requests.get(sdf_url, allow_redirects=True)
        open("/content/"+self.sdf_in, "wb").write(sdf.content)
        print("done!")

        # Open SDF
        fo_sdf = open("/content/"+self.sdf_in,"r")

        # Set up lists and strings
        self.ligands = []
        self.binding = []
        self.list_found = []
        self.data_out = ""
        self.nan_data_out = ""
        lig_data = ""

        # Assign initial values boolean variables and count
        MonomerID_flag = False  # Flag to indicate
                                # it found "> <BindingDB MonomerID>"
        bind_string_flag = False
        get_line = False        # Boolean variable to omit first line
        self.count_mols = 0

        # Looping through fo_sdf
        print("\nCleaning BindingDB SD file: "+self.sdf_in,end = "...")
        for line in fo_sdf:
            # For BindingDB MonomerID
            if MonomerID_flag:
                lig_id = str(line)
                MonomerID_flag = False

            elif "> <BindingDB MonomerID>" in str(line):
                MonomerID_flag = True

            # For binding affinity value
            if bind_string_flag:
                binding_value = str(line)
                bind_string_flag = False

            elif self.bind_string in str(line):
                bind_string_flag = True

            # For other lines
            if get_line:
                lig_data += str(line)

            # For EOF
            if "$$$$" in str(line):
                # Update count_mols
                self.count_mols += 1

                # Handle data to write
                get_line = False
                lig = lig_id.replace("\n","")
                binding_value = binding_value.replace("\n","")
                if lig not in self.ligands and lig not in self.mols2exclude:
                    # Assign True to a Boolean variable
                    flag2include = True

                    # Looping through self.mols2exclude
                    for aux_lig in self.mols2exclude:
                        aux_lig = str(aux_lig).replace("\n","").replace(" ","")
                        if aux_lig in lig_data and aux_lig not in \
                                                                self.list_found:
                            self.list_found.append(aux_lig)
                            flag2include = False
                            break

                    # Include ligand if flag2include = True
                    if flag2include:
                        if "<" not in str(binding_value) and \
                                                ">" not in str(binding_value):
                            self.ligands.append(lig)
                            self.binding.append(binding_value)
                            self.data_out += lig_id
                            self.data_out += lig_data
                        else:
                            # Update nan_data_out
                            aux_nan = "NaN or duplicated ligand for: "
                            aux_nan += lig_id.replace("\n","")
                            aux_nan += " ("+binding_value+")\n"
                            self.nan_data_out += aux_nan

                    else:
                        # Update nan_data_out
                        aux_nan = "NaN or duplicated ligand for: "
                        aux_nan += lig_id.replace("\n","")
                        aux_nan += " ("+binding_value+")\n"
                        self.nan_data_out += aux_nan

                else:
                    # Update nan_data_out
                    aux_nan = "NaN or duplicated ligand for: "
                    aux_nan += lig_id.replace("\n","")
                    aux_nan += " ("+binding_value+")\n"
                    self.nan_data_out += aux_nan

                lig_data = ""
                get_line = False
            else:
                get_line = True

        print("done!")

        # Close SDF
        fo_sdf.close()

    # Define write_sdf() method
    def write_sdf(self):
        """Method to write a previously cleaned SDF."""
        # Open a new SDF
        fo_new = open("/content/"+self.sdf_out,"w")

        # Write data_out
        fo_new.write(self.data_out)

        # Close SDF
        fo_new.close()

    # Define write_binding_csv() method
    def write_binding_csv(self):
        """Method to write a CSV file with ligand identifications and binding
        affinity values."""
        # Define feature_bind and lig_binding_data
        feature_bind = self.bind_string.replace(" (nM)","").replace("<","").\
                                replace(">","")
        lig_binding_data = "Ligand,"+feature_bind+"\n"

        # Looping througn self.ligands to get ligands an binding affinity
        for i,lig in enumerate(self.ligands):
            lig_binding_data += str(lig)+","+str(self.binding[i])+"\n"

        # Open a new CSV file
        fo_csv = open("/content/"+self.csv_out,"w")

        # Write lig_binding_data
        fo_csv.write(lig_binding_data)

        # Close CSV
        fo_csv.close()

    # Define summarize() method
    def summarize(self):
        """Method to show a summary of the results with input file,
        output files, and the number unique ligands."""
        # Show summary
        summary = "\n\n"+59*"#"
        summary += "\n"+"#"+24*" "+" SUMMARY "+24*" "+"#"
        summary += "\n"+59*"#"
        summary += "\nInput SDF: "+self.sdf_in
        summary += "\nNumber of molecules found in the BindingDB SDF: "
        summary += str(self.count_mols)
        summary += "\nNumber of exluded molecules: "
        summary += str(self.count_mols - len(self.ligands))
        summary += "\nOutput SDF (for docking with MVD): "+self.sdf_out
        summary += "\nOutput CSV file: "+self.csv_out
        summary += "\nNumber of molecules written in the outputs files: "
        summary += str(len(self.ligands))+"\n"
        summary += 59*"#"
        print(summary)
        print()
        print(self.nan_data_out)

################################################################################
# Define main() function                                                       #
################################################################################
def main():
    # Define inputs for each dataset
    ############################################################################
    #  Cyclin-dependent kinase/G2/mitotic-specific cyclin- 1 [ 181 ]
    ############################################################################
    tsv_in = "CDK1.tsv"                            # TSV file from the BindingDB
    tsv_id = "1wViOlZt6o258GstiDxqj9AgQUJuBk1RB"   # Drive id for a TSV file
    bind_in = "Ki (nM)"                            # "IC50 (nM)" "Kd (nM)"
    sdf_in = "CDK1.sdf"                            # SD file from the BindingDB
    sdf_id = "13UcgkhkdmKXTm2VMsp1uW7oK624EmS1T"   # Drive id for a SD file
    bind_string = "> <Ki (nM)>"                    # String with binding
                                                   # affinity in the following
                                                   # line of an SD file

    ############################################################################
    # Cyclin-dependent kinase 2/G1/S-specific cyclin-E1 [ 1450 (2102) ]
    ############################################################################
    #tsv_in = "CDK2.tsv"                           # TSV file from the BindingDB
    #tsv_id = "1nEpxG7PRgZbEvreedU6CShSvE3PjtBXI"  # Drive id for a TSV file
    #bind_in = "Ki (nM)"                           # "IC50 (nM)" "Kd (nM)"
    #sdf_in = "CDK2.sdf"                           # SD file from the BindingDB
    #sdf_id = "1gnOFpGZpdylN21EW2_sPqYmrNHWd9bDE"  # Drive id for a SD file
    #bind_string = "> <Ki (nM)>"                   # String with binding
                                                   # affinity in the following
                                                   # line of an SD file

    ############################################################################
    # Cyclin-A2/Cyclin-dependent kinase 2 [ 164 ]
    ############################################################################
    #tsv_in = "CDK2-Cyclin_A2_Ki.tsv"              # TSV file from the BindingDB
    #tsv_id = "1oLT0ADlmD4GFE9hu9Iv4L7yQ3ipNsThb"  # Drive id for a TSV file
    #bind_in = "Ki (nM)"                           # "IC50 (nM)" "Kd (nM)"
    #sdf_in = "CDK2-Cyclin_A2_Ki.sdf"              # SD file from the BindingDB
    #sdf_id = "1A5VgZ3NXpoof0PMhknilqWCYqgOLeSx1"  # Drive id for a SD file
    #bind_string = "> <Ki (nM)>"                   # String with binding
                                                   # affinity in the following
                                                   # line of an SD file

    ############################################################################
    # Cyclin-dependent kinase 4/G1/S-specific cyclin-D1 [ 455 (460) ]
    ############################################################################
    #tsv_in = "CDK4.tsv"                           # TSV file from the BindingDB
    #tsv_id = "1GcsOhUBBsIQm0S9aQoxNW7GkHFYdO6cz"  # Drive id for a TSV file
    #bind_in = "Ki (nM)"                           # "Kd (nM)" "IC50 (nM)"
    #sdf_in = "CDK4.sdf"                           # SD file from the BindingDB
    #sdf_id = "11Xu9ip9YSjp9TF_ra7bSl1xYMp4JO5rM"  # Drive id for a SD file
    #bind_string = "> <Ki (nM)>"                   # String with binding
                                                   # affinity in the following
                                                   # line of an SD file

    ############################################################################
    # Cyclin-dependent kinase 6/G1/S-specific cyclin-D1 [ 415 ]
    ############################################################################
    #tsv_in = "CDK6.tsv"                           # TSV file from the BindingDB
    #tsv_id = "1sOFLG8Afq_fKh3rYLEA19vgbTkWpxxmr"  # Drive id for a TSV file
    #bind_in = "Ki (nM)"                           # "Kd (nM)" "Ki (nM)"
    #sdf_in = "CDK6.sdf"                           # SD file from the BindingDB
    #sdf_id = "1khdWR1-Y8wyGRicr3KrQbrX4wAZ6j1fh"  # Drive id for a SD file
    #bind_string = "> <Ki (nM)>"                   # String with binding
                                                   # affinity in the following
                                                   # line of an SD file

    ############################################################################
    # Cyclin-H/Cyclin-dependent kinase 7 [ 123 ]
    ############################################################################
    #tsv_in = "CDK7.tsv"                           # TSV file from the BindingDB
    #tsv_id = "1xBdyhOAgl0k1JYI6V2HS4IshAtoKPwrQ"  # Drive id for a TSV file
    #bind_in = "Ki (nM)"                           # "Kd (nM)" "IC50 (nM)"
    #sdf_in = "CDK7.sdf"                           # SD file from the BindingDB
    #sdf_id = "1aSmBAVBgnKxmTqHIpwD9sbXpBV6FHDEA"  # Drive id for a SD file
    #bind_string = "> <Ki (nM)>"                   # String with binding
                                                   # affinity in the following
                                                   # line of an SD file

    ############################################################################
    # Cyclin-T1/Cyclin-dependent kinase 9 [ 201 ]
    ############################################################################
    #tsv_in = "CDK9.tsv"                           # TSV file from the BindingDB
    #tsv_id = "1kAc7sVO6kuPIsNQxITC09r2i2ZQj96rI"  # Drive id for a TSV file
    #bind_in = "Ki (nM)"                           # "Kd (nM)" "IC50 (nM)"
    #sdf_in = "CDK9.sdf"                           # SD file from the BindingDB
    #sdf_id = "1eR0Fryf_mPidIDGZVNiHTUIzlQNGG_34"  # Drive id for a SD file
    #bind_string = "> <Ki (nM)>"                   # String with binding
                                                   # affinity in the following
                                                   # line of an SD file

    ############################################################################
    # Cyclin-C/Cyclin-dependent kinase 19 [ 123 (124) ]
    ############################################################################
    #tsv_in = "CDK19.tsv"                          # TSV file from the BindingDB
    #tsv_id = "1j0Qd5UvI7SHZ6cgUUj1DiYFhF3xNlP5p"  # Drive id for a TSV file
    #bind_in = "IC50 (nM)"                         # "Kd (nM)" "Ki (nM)"
    #sdf_in = "CDK19.sdf"                          # SD file from the BindingDB
    #sdf_id = "15aJCKssO7oygb3JPz7uPd_-4-OPxuYmg"  # Drive id for a SD file
    #bind_string = "> <IC50 (nM)>"                 # String with binding
                                                   # affinity in the following
                                                   # line of an SD file

    # Instantiate an object of TSV class
    t1 = TSV(tsv_in,tsv_id,bind_in)

    # Invoke read() method
    t1.read()

    # Invoke write() method (not necessary here)
    #t1.write()

    # Invoke summarize() method
    t1.summarize()

    # Instantiate an object of SDF() class
    mol1 = SDF(sdf_in,sdf_id,t1.mols2exclude,bind_string)

    # Call clean() method
    mol1.clean()

    # Call write_sdf() method
    mol1.write_sdf()

    # Call write_binding_csv() method
    mol1.write_binding_csv()

    # Call summarize() method
    mol1.summarize()

# Call main() function
main()


Downloading CDK1.tsv...done!


###########################################################
#                LIGANDS WITH MISSING DATA                #
###########################################################
"51130055",
"51130258",
"51130259",
"50224950",
"50224960",
"50224961",
"51083834",
"51220708",
"51220773",
"51220775",
"51220795",
"51220800",
"51220828",
"51220831",
"51220833",
"51220835",
"51220837",
"51288071",
"51288081",
"51288082",
"51288083",
"51288113",
"51407540",
"51407542",
"51407545",
"51407546",
"51407547",
"51407548",
"51407549",
"51407550",
"51407551",
"51407554",
"51407555",
"51407556",
"22681",
"22682",
"22683",
"22684",
"22685",
"50115832",
"51036838",
"51036845",
"51036848",
###########################################################


###########################################################
#                         SUMMARY                         #
###########################################################
Input TSV file: CDK1.tsv
Type of binding affi