## Datasets

In [1]:
from chemprop.data.datasets import MoleculeDataset, ReactionDataset, MulticomponentDataset

To make a dataset you first need a list of [datapoints](./datapoints.ipynb).

In [2]:
import numpy as np
from chemprop.data import MoleculeDatapoint, ReactionDatapoint

ys = np.random.rand(2, 1)

smis = ["C", "CC"]
mol_datapoints = [MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(smis, ys)]

rxn_smis = ["[H:2][O:1][H:3]>>[H:2][O:1].[H:3]", "[H:2][S:1][H:3]>>[H:2][S:1].[H:3]"]
rxn_datapoints = [
    ReactionDatapoint.from_smi(rxn_smi, y, keep_h=True) for rxn_smi, y in zip(rxn_smis, ys)
]

### Molecule Datasets

`MoleculeDataset`s are made from a list of `MoleculeDatapoint`s.

In [3]:
MoleculeDataset(mol_datapoints)

MoleculeDataset(data=[MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7fe613efd1c0>, y=array([0.1898137]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='C', V_f=None, E_f=None, V_d=None), MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7fe613efd2a0>, y=array([0.66960134]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='CC', V_f=None, E_f=None, V_d=None)], featurizer=SimpleMoleculeMolGraphFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7fe613f20fd0>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7fe613f211d0>))

### Dataset properties

The properties of datapoints are collated in a dataset.

In [4]:
dataset = MoleculeDataset(mol_datapoints)
print(dataset.Y)
print(dataset.names)

[[0.1898137 ]
 [0.66960134]]
['C', 'CC']


Datasets return a `Datum` when indexed. A `Datum` contains a `MolGraph` (see the [molgraph featurizer notebook](../featurizers/molgraph_molecule_featurizer.ipynb)), the extra atom and datapoint level descriptors, the target(s), the weights, and masks for bounded loss functions.

In [5]:
dataset[0]

Datum(mg=MolGraph(V=array([[0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        1.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        1.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 1.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
        0.     , 0.12011]], dtype=float32), E=array([], shape=(0, 14), dtype=float64), edge_index=array([], shape=(2, 0), dtype=int64), rev_edge_index=array([], dtype=int64)), V_d=None, x_d=None, y=array([0.1898137]), weight=1.0, lt_mask=None, gt_mask=None)

### Caching

The `MolGraph`s are generated as needed by default. For small to medium dataset (exact sizes not yet benchmarked), it is more efficient to generate and cache the molgraphs when the dataset is created. 

If the cache needs to be recreated, set the cache to True again. To clear the cache, set it to False. 

Note we recommend [scaling](../scaling.ipynb) additional atom and bond features before setting the cache, as scaling them after caching will require the cache to be recreated, which is done automatically.

In [6]:
dataset.cache = True  # Generate the molgraphs and cache them
dataset.cache = True  # Recreate the cache
dataset.cache = False  # Clear the cache

dataset.cache = True  # Cache created with unscaled extra bond features
dataset.normalize_inputs(key="E_f")  # Cache recreated automatically with scaled extra bond features

### Datasets with custom featurizers

Datasets use a molgraph featurizer to create the `MolGraphs`s from the `rdkit.Chem.Mol` objects in datapoints. A basic `SimpleMoleculeMolGraphFeaturizer` is the default featurizer for `MoleculeDataset`s. If you are using a [custom molgraph featurizer](../featurizers/molgraph_molecule_featurizer.ipynb), pass it as an argument when creating the dataset.

In [7]:
from chemprop.featurizers import SimpleMoleculeMolGraphFeaturizer, MultiHotAtomFeaturizer

mol_featurizer = SimpleMoleculeMolGraphFeaturizer(atom_featurizer=MultiHotAtomFeaturizer.v1())
MoleculeDataset(mol_datapoints, featurizer=mol_featurizer)

MoleculeDataset(data=[MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7fe613efd1c0>, y=array([0.1898137]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='C', V_f=None, E_f=None, V_d=None), MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7fe613efd2a0>, y=array([0.66960134]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='CC', V_f=None, E_f=None, V_d=None)], featurizer=SimpleMoleculeMolGraphFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7fe613f31610>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7fe613f31650>))

### Reaction Datasets

Reaction datasets are the same as molecule datasets, except they are made from a list of `ReactionDatapoint`s and `CondensedGraphOfReactionFeaturizer` is the default featurizer. [CGRs](../featurizers/molgraph_reaction_featurizer.ipynb) are also `MolGraph`s.

In [8]:
ReactionDataset(rxn_datapoints).featurizer

CondensedGraphOfReactionFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7fe613f31250>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7fe613f31990>)

### Multicomponent datasets

`MulticomponentDataset` is for datasets whose target values depend on multiple components. It is composed of parallel `MoleculeDataset`s and `ReactionDataset`s.

In [9]:
mol_dataset = MoleculeDataset(mol_datapoints)
rxn_dataset = ReactionDataset(rxn_datapoints)

# e.g. reaction in solvent
multi_dataset = MulticomponentDataset(datasets=[mol_dataset, rxn_dataset])

# e.g. solubility
MulticomponentDataset(datasets=[mol_dataset, mol_dataset])

<chemprop.data.datasets.MulticomponentDataset at 0x7fe613f32d10>

A `MulticomponentDataset` collates dataset properties (e.g. SMILES) of each dataset. It does not collate datapoint level properties like target values and extra datapoint descriptors. Chemprop models automatically take those from **the first dataset** in datasets.

In [10]:
multi_dataset.smiles

[('C', ('[O:1]([H:2])[H:3]', '[H:3].[O:1][H:2]')),
 ('CC', ('[S:1]([H:2])[H:3]', '[H:3].[S:1][H:2]'))]

In [11]:
multi_dataset.datasets[0].Y

array([[0.1898137 ],
       [0.66960134]])