The DeepChem library is packaged alongside the MoleculeNet suite of datasets. One of the most important parts of machine learning applications is finding a suitable dataset. The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem dc.data.Dataset
objects for convenience.
If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.
- Read the Contribution guidelines.
- Open an issue to discuss the dataset you want to add to MolNet.
- Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader and implements a create_dataset method. See the _QM9Loader for a simple example.
- Write a load_dataset function that documents the dataset and add your load function to deepchem.molnet.__init__.py for easy importing.
- Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.
- Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.
- Add documentation for your loader to the MoleculeNet docs.
- Submit a [WIP] PR (Work in progress pull request) following the PR template.
Below is an example of how to load a MoleculeNet dataset and featurizer. This approach will work for any dataset in MoleculeNet by changing the load function and featurizer. For more details on the featurizers, see the Featurizers section. :
import deepchem as dc
from deepchem.feat.molecule_featurizers import MolGraphConvFeaturizer
featurizer = MolGraphConvFeaturizer(use_edges=True)
dataset_dc = dc.molnet.load_qm9(featurizer=featurizer)
tasks, dataset, transformers = dataset_dc
train, valid, test = dataset
x,y,w,ids = train.X, train.y, train.w, train.ids
Note that the "w" matrix represents the weight of each sample. Some assays may have missing values, in which case the weight is 0. Otherwise, the weight is 1.
Additionally, the environment variable DEEPCHEM_DATA_DIR
can be set like os.environ['DEEPCHEM_DATA_DIR'] = path/to/store/featurized/dataset
. When the DEEPCHEM_DATA_DIR
environment variable is set, molnet loader stores the featurized dataset in the specified directory and when the dataset has to be reloaded the next time, it will be fetched from the data directory directly rather than featurizing the raw dataset from scratch.
deepchem.molnet.load_bace_classification
deepchem.molnet.load_bace_regression
deepchem.molnet.load_bbbc001
deepchem.molnet.load_bbbc002
deepchem.molnet.load_bbbc003
deepchem.molnet.load_bbbc004
deepchem.molnet.load_bbbc005
BBBP stands for Blood-Brain-Barrier Penetration
deepchem.molnet.load_bbbp
deepchem.molnet.load_cell_counting
deepchem.molnet.load_chembl
deepchem.molnet.load_chembl25
deepchem.molnet.load_clearance
deepchem.molnet.load_clintox
deepchem.molnet.load_delaney
deepchem.molnet.load_factors
deepchem.molnet.load_freesolv
deepchem.molnet.load_hiv
HOPV stands for the Harvard Organic Photovoltaic Dataset.
deepchem.molnet.load_hopv
deepchem.molnet.load_hppb
deepchem.molnet.load_kaggle
deepchem.molnet.load_kinase
deepchem.molnet.load_lipo
Materials datasets include inorganic crystal structures, chemical compositions, and target properties like formation energies and band gaps. Machine learning problems in materials science commonly include predicting the value of a continuous (regression) or categorical (classification) property of a material based on its chemical composition or crystal structure. "Inverse design" is also of great interest, in which ML methods generate crystal structures that have a desired property. Other areas where ML is applicable in materials include: discovering new or modified phenomenological models that describe material behavior
deepchem.molnet.load_bandgap
deepchem.molnet.load_perovskite
deepchem.molnet.load_mp_formation_energy
deepchem.molnet.load_mp_metallicity
deepchem.molnet.load_muv
deepchem.molnet.load_nci
deepchem.molnet.load_pcba
deepchem.molnet.load_pdbbind
deepchem.molnet.load_ppb
deepchem.molnet.load_qm7
deepchem.molnet.load_qm8
deepchem.molnet.load_qm9
deepchem.molnet.load_sampl
deepchem.molnet.load_sider
deepchem.molnet.load_thermosol
deepchem.molnet.load_tox21
deepchem.molnet.load_toxcast
deepchem.molnet.load_uspto
deepchem.molnet.load_uv
deepchem.molnet.load_zinc15
deepchem.molnet.load_Platinum_Adsorption