## Datasets

In [None]:
from chemprop.data.datasets import MoleculeDataset, PolymerDataset, ReactionDataset, MulticomponentDataset

To make a dataset you first need a list of [datapoints](./datapoints.ipynb).

In [None]:
import numpy as np
from chemprop.data import MoleculeDatapoint, LazyMoleculeDatapoint, PolymerDatapoint, ReactionDatapoint

ys = np.random.rand(2, 1)

smis = ["C", "CC"]
mol_datapoints = [MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(smis, ys)]

rxn_smis = ["[H:2][O:1][H:3]>>[H:2][O:1].[H:3]", "[H:2][S:1][H:3]>>[H:2][S:1].[H:3]"]
rxn_datapoints = [
    ReactionDatapoint.from_smi(rxn_smi, y, keep_h=True) for rxn_smi, y in zip(rxn_smis, ys)
]

polymer_smis = ["[*:1]c1cc(F)c([*:2])cc1F.[*:3]c1c(O)cc(O)c([*:4])c1O|0.5|0.5|<1-3:0.5:0.5<1-4:0.5:0.5<2-3:0.5:0.5<2-4:0.5:0.5",
                "[*:1]c1cc(F)c([*:2])cc1F.[*:3]c1c(O)cc(O)c([*:4])c1O|0.5|0.5|<1-2:0.375:0.375<1-1:0.375:0.375<2-2:0.375:0.375<3-4:0.375:0.375<3-3:0.375:0.375<4-4:0.125:0.125<1-3:0.125:0.125<1-4:0.125:0.125<2-3:0.125:0.125<2-4:0.125:0.125"
                ]
polymer_datapoints = [PolymerDatapoint.from_smi(poly_smi, y) for poly_smi, y in zip(polymer_smis, ys)]

### Molecule Datasets

`MoleculeDataset`s are made from a list of `MoleculeDatapoint`s.

In [None]:
MoleculeDataset(mol_datapoints)

### Dataset properties

The properties of datapoints are collated in a dataset.

In [None]:
dataset = MoleculeDataset(mol_datapoints)
print(dataset.Y)
print(dataset.names)

Datasets return a `Datum` when indexed. A `Datum` contains a `MolGraph` (see the [molgraph featurizer notebook](../featurizers/molgraph_molecule_featurizer.ipynb)), the extra atom and datapoint level descriptors, the target(s), the weights, and masks for bounded loss functions.

In [None]:
dataset[0]

### Caching

The `MolGraph`s are generated as needed by default. For small to medium dataset (exact sizes not yet benchmarked), it is more efficient to generate and cache the molgraphs when the dataset is created. 

If the cache needs to be recreated, set the cache to True again. To clear the cache, set it to False. 

Note we recommend [scaling](../scaling.ipynb) additional atom and bond features before setting the cache, as scaling them after caching will require the cache to be recreated, which is done automatically.

In [None]:
dataset.cache = True  # Generate the molgraphs and cache them
dataset.cache = True  # Recreate the cache
dataset.cache = False  # Clear the cache

dataset.cache = True  # Cache created with unscaled extra bond features
dataset.normalize_inputs(key="E_f")  # Cache recreated automatically with scaled extra bond features

### CuikmolmakerDataset (available with `cuik-molmaker` only)
This dataset constructs and featurizes a batch of molecules at once instead of one at a time using `cuik-molmaker`.

In [None]:
from chemprop.utils.utils import is_cuikmolmaker_available
print(f"cuik-molmaker available: {is_cuikmolmaker_available()}")

In [None]:
if is_cuikmolmaker_available():
    from chemprop.data.datasets import CuikmolmakerDataset
    import pandas as pd

    smi_df = pd.read_csv("../../../../../tests/data/smis.csv")

    lazy_mol_datapoints = [LazyMoleculeDatapoint(smi) for smi in smi_df["smiles"]]
    cuik_dataset = CuikmolmakerDataset(lazy_mol_datapoints)
    len(cuik_dataset)

In [None]:
# CuikmolmakerDataset implements `__getitems__` function instead of `__getitem__` enabling batched dataset featurization and access.
if is_cuikmolmaker_available():
    cuik_dataset.__getitems__([1, 2, 12, 34])

### Datasets with custom featurizers

Datasets use a molgraph featurizer to create the `MolGraphs`s from the `rdkit.Chem.Mol` objects in datapoints. A basic `SimpleMoleculeMolGraphFeaturizer` is the default featurizer for `MoleculeDataset`s. If you are using a [custom molgraph featurizer](../featurizers/molgraph_molecule_featurizer.ipynb), pass it as an argument when creating the dataset.

In [None]:
from chemprop.featurizers import SimpleMoleculeMolGraphFeaturizer, MultiHotAtomFeaturizer

mol_featurizer = SimpleMoleculeMolGraphFeaturizer(atom_featurizer=MultiHotAtomFeaturizer.v1())
MoleculeDataset(mol_datapoints, featurizer=mol_featurizer)

### Reaction Datasets

Reaction datasets are the same as molecule datasets, except they are made from a list of `ReactionDatapoint`s and `CondensedGraphOfReactionFeaturizer` is the default featurizer. [CGRs](../featurizers/molgraph_reaction_featurizer.ipynb) are also `MolGraph`s.

In [None]:
ReactionDataset(rxn_datapoints).featurizer

### Polymer Datasets

Polymer datasets are the same as molecule datasets, except they are made from a list of `PolymerDatapoint`s and `PolymerMolGraphFeaturizer` is the default featurizer. Polymers are `WeightedMolGraph`s, which include additional atom and bond weight information.

In [None]:
PolymerDataset(polymer_datapoints).featurizer

### Multicomponent datasets

`MulticomponentDataset` is for datasets whose target values depend on multiple components. It is composed of parallel `MoleculeDataset`s and `ReactionDataset`s.

In [None]:
mol_dataset = MoleculeDataset(mol_datapoints)
rxn_dataset = ReactionDataset(rxn_datapoints)

# e.g. reaction in solvent
multi_dataset = MulticomponentDataset(datasets=[mol_dataset, rxn_dataset])

# e.g. solubility
MulticomponentDataset(datasets=[mol_dataset, mol_dataset])

A `MulticomponentDataset` collates dataset properties (e.g. SMILES) of each dataset. It does not collate datapoint level properties like target values and extra datapoint descriptors. Chemprop models automatically take those from **the first dataset** in datasets.

In [None]:
multi_dataset.smiles

In [None]:
multi_dataset.datasets[0].Y