Skip to content

Latest commit

 

History

History
271 lines (168 loc) · 7.77 KB

moleculenet.rst

File metadata and controls

271 lines (168 loc) · 7.77 KB

MoleculeNet

The DeepChem library is packaged alongside the MoleculeNet suite of datasets. One of the most important parts of machine learning applications is finding a suitable dataset. The MoleculeNet suite has curated a whole range of datasets and loaded them into DeepChem dc.data.Dataset objects for convenience.

Contributing a new dataset to MoleculeNet

If you are proposing a new dataset to be included in the MoleculeNet benchmarking suite, please follow the instructions below. Please review the datasets already available in MolNet before contributing.

  1. Read the Contribution guidelines.
  2. Open an issue to discuss the dataset you want to add to MolNet.
  3. Write a DatasetLoader class that inherits from deepchem.molnet.load_function.molnet_loader._MolnetLoader and implements a create_dataset method. See the _QM9Loader for a simple example.
  4. Write a load_dataset function that documents the dataset and add your load function to deepchem.molnet.__init__.py for easy importing.
  5. Prepare your dataset as a .tar.gz or .zip file. Accepted filetypes include CSV, JSON, and SDF.
  6. Ask a member of the technical steering committee to add your .tar.gz or .zip file to the DeepChem AWS bucket. Modify your load function to pull down the dataset from AWS.
  7. Add documentation for your loader to the MoleculeNet docs.
  8. Submit a [WIP] PR (Work in progress pull request) following the PR template.

Example Usage

Below is an example of how to load a MoleculeNet dataset and featurizer. This approach will work for any dataset in MoleculeNet by changing the load function and featurizer. For more details on the featurizers, see the Featurizers section. :

import deepchem as dc
from deepchem.feat.molecule_featurizers import MolGraphConvFeaturizer

featurizer = MolGraphConvFeaturizer(use_edges=True)
dataset_dc = dc.molnet.load_qm9(featurizer=featurizer)
tasks, dataset, transformers = dataset_dc
train, valid, test = dataset

x,y,w,ids = train.X, train.y, train.w, train.ids

Note that the "w" matrix represents the weight of each sample. Some assays may have missing values, in which case the weight is 0. Otherwise, the weight is 1.

Additionally, the environment variable DEEPCHEM_DATA_DIR can be set like os.environ['DEEPCHEM_DATA_DIR'] = path/to/store/featurized/dataset. When the DEEPCHEM_DATA_DIR environment variable is set, molnet loader stores the featurized dataset in the specified directory and when the dataset has to be reloaded the next time, it will be fetched from the data directory directly rather than featurizing the raw dataset from scratch.

BACE Dataset

deepchem.molnet.load_bace_classification

deepchem.molnet.load_bace_regression

BBBC Datasets

deepchem.molnet.load_bbbc001

deepchem.molnet.load_bbbc002

deepchem.molnet.load_bbbc003

deepchem.molnet.load_bbbc004

deepchem.molnet.load_bbbc005

BBBP Datasets

BBBP stands for Blood-Brain-Barrier Penetration

deepchem.molnet.load_bbbp

Cell Counting Datasets

deepchem.molnet.load_cell_counting

Chembl Datasets

deepchem.molnet.load_chembl

Chembl25 Datasets

deepchem.molnet.load_chembl25

Clearance Datasets

deepchem.molnet.load_clearance

Clintox Datasets

deepchem.molnet.load_clintox

Delaney Datasets

deepchem.molnet.load_delaney

Factors Datasets

deepchem.molnet.load_factors

Freesolv Dataset

deepchem.molnet.load_freesolv

HIV Datasets

deepchem.molnet.load_hiv

HOPV Datasets

HOPV stands for the Harvard Organic Photovoltaic Dataset.

deepchem.molnet.load_hopv

HPPB Datasets

deepchem.molnet.load_hppb

KAGGLE Datasets

deepchem.molnet.load_kaggle

Kinase Datasets

deepchem.molnet.load_kinase

Lipo Datasets

deepchem.molnet.load_lipo

Materials Datasets

Materials datasets include inorganic crystal structures, chemical compositions, and target properties like formation energies and band gaps. Machine learning problems in materials science commonly include predicting the value of a continuous (regression) or categorical (classification) property of a material based on its chemical composition or crystal structure. "Inverse design" is also of great interest, in which ML methods generate crystal structures that have a desired property. Other areas where ML is applicable in materials include: discovering new or modified phenomenological models that describe material behavior

deepchem.molnet.load_bandgap

deepchem.molnet.load_perovskite

deepchem.molnet.load_mp_formation_energy

deepchem.molnet.load_mp_metallicity

MUV Datasets

deepchem.molnet.load_muv

NCI Datasets

deepchem.molnet.load_nci

PCBA Datasets

deepchem.molnet.load_pcba

PDBBIND Datasets

deepchem.molnet.load_pdbbind

PPB Datasets

deepchem.molnet.load_ppb

QM7 Datasets

deepchem.molnet.load_qm7

QM8 Datasets

deepchem.molnet.load_qm8

QM9 Datasets

deepchem.molnet.load_qm9

SAMPL Datasets

deepchem.molnet.load_sampl

SIDER Datasets

deepchem.molnet.load_sider

Thermosol Datasets

deepchem.molnet.load_thermosol

Tox21 Datasets

deepchem.molnet.load_tox21

Toxcast Datasets

deepchem.molnet.load_toxcast

USPTO Datasets

deepchem.molnet.load_uspto

UV Datasets

deepchem.molnet.load_uv

ZINC15 Datasets

deepchem.molnet.load_zinc15

Platinum Adsorption Dataset

deepchem.molnet.load_Platinum_Adsorption