Skip to content

2 Prepare Databases

Sascha Jung edited this page May 10, 2021 · 14 revisions

Databases

VSFlow contains a tool to prepare compound libraries for virtual screening. It allows for standardization of the molecules, generation of fingerprints and generation of multiple conformers. As input files, sdf/sdf.gz, csv and excel (containing molecules as SMILES or InChI) and a bunch of different text files (.smi, sma, .ich, .tsv, .txt containing SMILES/InChI) are supported. The output file is a "virtual screening database" (.vsdb) file. The vsdb file is a pickle file containing all molecules in a special dictionary format ready to use with VSFlow.

General Usage

Always make sure the conda environment is activated.

vsflow preparedb {arguments}

The following command will display the help with all available arguments:

vsflow preparedb -h

Arguments:

Required:

  • -i, --input
    specify path of input file

OR

  • -d, --download
    specify shortcut for database that should be downloaded [chembl or pdb]

Optional:

  • -o , --output
    specify name (and path) of output file without file extension [default: prep_database]
  • -int , --integrate
    specify shortcut for database; saves database to $HOME/VSFlow_Databases
  • -intg, --int_global
    specify shortcut for database; stores prepared database within the repository folder
  • -s, --standardize
    standardizes molecules, removes salts and associated charges
  • -can, --canonicalize
    if specified, the canonical tautomer for every molecule is generated and stored in the output database file
  • -c, --conformers
    generates multiple 3D conformers for database molecules
  • -np , --nproc
    specify number of processors to run application in multiprocessing mode
  • -f, --fingerprint
    f specified, the selected fingerprint is generated for database molecules [rdkit, ecfp, fcfp, ap, tt, maccs]
  • -r, --radius
    specify radius for circular fingerprints ecfp and fcfp [default: 2]
  • -nb, --nbits
    specify bit size of fingerprints [default: 2048]
  • --no_chiral
    if specified, chirality of molecules will not be considered for fingerprint generation
  • --max_tauts
    maximum number of tautomers to be enumerated during standardization process [default: 100]
  • --nconfs
    maximum number of conformers generated, [default: 20]
  • --rms_thresh
    if specified, only those conformations out of --nconfs that are at least this different are retained (RMSD calculated on heavy atoms)
  • --seed
    specify seed for random number generator, for reproducibility
  • --boost
    distributes conformer generation on all available threads of your cpu
  • --header
    Specify number of row in csv/xlsx file containing the column names, if not automatically recognized [e.g. 1 for first row]
  • --mol_column
    Specify name (or position) of mol column [SMILES/InChI] in csv/xlsx file, if not automatically recognized [e.g. 'SMILES' or '1' (for first column)]
  • --delimiter
    Specify delimiter of csv file, if not automatically recognized
  • -h, --help
    show this help message and exit

Detailed Usage

Navigate into the examples folder.
For the following examples, you can download all ligands from the PDB database (with ideal geometries) or the Chembl database directly within VSFlow as follows (you need a working internet connection), e.g.:

vsflow preparedb -d pdb -o pdb_ligs 

With the above command, the file containing the pdb ligands is automatically downloaded and written to the output file named pdb_ligs.vsdb in the examples folder. You can perform perform all operations described in the following using this file as input.

However, to quickly demonstrate the usage, you can specify the SD file containing all approved FDA drugs (approx. 1600 compounds, downloaded from public Zinc database: https://zinc.docking.org/substances/subsets/fda/) in the examples folder as input:

vsflow preparedb -i fda.sdf -o fda_drugs 

In the above example, the input is simply converted to a vsdb file, without performing any standardization or conformer/fingerprint generation. It is generally recommended to convert frequently used compound libraries to vsdb files because loading speed is typically much faster. If you do not provide an output file name, the output is written to prep_database.vsdb by default. It is not necessary to provide the file extension for the output file, the databases will always be saved as pickle file with the file extension .vsdb. However, it is essential to provide the file extension for the input file since VSFlow recognizes the file format from its file extension. Other supported input file formats are text files (.smi, .sma, .ich, .txt/.txt.gz, .csv/.csv.gz, .tsv/.tsv.gz), gzipped SD files (.sdf.gz) and excel files (.xlsx).

By specifying the -s/--standardize flag, all compounds in the database are additionally standardized:

vsflow preparedb -i fda.sdf -o fda_std -s

Standardization includes: standardization according to molvs [1] rules, disconnecting metals and salts and removing charges. Additionally, it is possible to generate the canonical tautomer (-can/--canonicalize argument) and store it in addition to the standardized molecule in the database file:

vsflow preparedb -i fda.sdf -o fda_std -s -can

Standardization and canonicalization is generally recommended and required to use some screening capabilities of VSFlow properly.

Since tautomer generation can be time-consuming for larger compound databases, parallelization is possible to speed things up via the -np/--nproc argument, e.g. by running on 6 cores/threads (probably available on most modern machines):

vsflow preparedb -i fda.sdf -o fda_std -s -can -np 6

Parallelization is done via Python's built-in multiprocessing module.

By specifying the -f/--fingerprint argument, the selected fingerprints are generated for all molecules and stored within the output database file:

vsflow preparedb -i fda.sdf -o fda_std_fps -s  -can -f ecfp -np 6

The following fingerprints are supported:

Fingerprints

  • fcfp: FCFP-like Morgan fingerprint from the RDKit (extended connectivity fingerprint with pharmacophore feature definitions, circular fingerprint)
  • ecfp: ECFP-like Morgan fingerprint from the RDKit (extended connectivity fingerprint, circular fingerprint)
  • rdkit: RDKit fingerprint (Daylight-like fingerprint, substructure fingerprint)
  • ap: Atom Pairs fingerprint from the RDKit
  • tt: Topological Torsion fingerprint from the RDKit
  • maccs: SMARTS-based implementation of the 166 public MACCS key from the RDKit

Via the -r/--radius argument, the radius (default: 2) for circular fingerprints ecfp and fcfp can be changed. The -nb/--nbits argument changes the bit size of the fingerprint (default: 2048) and if --no_chiral argument, chirality is ignored, e.g.:

vsflow preparedb -i fda.sdf -o fda_fps -f ecfp -r 3 -nb 4096 --no_chiral

By specifying the -c/--conformers flag, 3D conformers for all database compounds are generated:

vsflow preparedb -i fda.sdf -o fda_confs -c

You can optionally use the --seed argument to specify a seed for the random number generator for reproducibility purposes:

vsflow preparedb -i fda.sdf -o fda_std_confs -c --seed 42

Caveat: 3D information of molecules read from an input SD file are overwritten when generating conformers!
By default, 20 conformers per molecule are generated. This may be changed using the --nconfs argument. By specifying the --rms_thresh argument, only those conformers out of --nconfs which have an RMSD deviation (calculated on heavy atoms) greater than the specified value are retained, e.g.:

vsflow preparedb -i fda.sdf -o fda_std_confs -c --nconfs 10 --rms_thresh 0.3

With the above statement, all compounds are standardized, then 10 conformers per compound are generated and only those with an RMSD deviation greater than 0.3 are retained. The --rmsd_thresh flag is useful to keep only those conformers with significant differences.

Since conformer generation can be time-consuming, it is reasonable to parallelize tasks using the -np/--nproc argument. Here you can specify the number of cores/threads of your system to be used for parallelization. VSFlow takes care you don't specify more threads than available. With the following command, molecules are standardized, 30 conformers are generated and fingerprints are calculated using 6 cores/threads:

vsflow preparedb -i fda.sdf -o fda_s_confs_fps -s -c -f ecfp --nconfs 30 -np 6

To further speed up calculations, the conformer generation itself can be further distributed to all available threads of the system via the C++ code of RDKit with the --boost flag:

vsflow preparedb -i fda.sdf -o fda_std_confs -s -c -f ecfp -np 6 --boost

This will distribute the calculations to 6 threads but will additionally use all other available threads to generate the conformers. You may see how the run time differs for the above examples on your machine.
Caveat: Make sure you do not run other important stuff when parallelizing to all available threads since this may slow down your machine!

Integration of Databases

Instead of specifying the -o/--output argument, it is also possible to "integrate" the database into VSFlow using the -int/--integrate and -intg/--int_global arguments. Do not provide a file extension, just specify a shortcut name:

vsflow preparedb -i fda.sdf -int fda -s -c

With the above command, the prepared database is saved as file named fda.vsdb to the folder "VSFlow_Databases" in the user's HOME directory. VSFlow can now access the database from throughout the system, the user only needs to pass the shortcut name to the -d/--database argument, e.g. for a substructure search (see page Substructure Search for more information):

vsflow substructure -smi C1=CN=CC=C1 -d fda -o pyr_subs.sdf

When the -intg/--int_global argument is specified instead, the prepared database is also saved to the folder $Home/VSFlow_Databases by default:

vsflow preparedb -i fda.sdf -intg fda -s

Both paths can be changed in the mode "managedb" (see Page "Manage Databases" for more information).
It may be useful to change the global database path (-intg/--int_global) if VSFlow is run on a server with multiple users and some databases should be accessible for all users.