# modelforge.curate : Basic Usage

This notebook will demonstrate basic usage of the curate module in modelforge, developed to make it easier to create datasets with a uniform structure, compatible with modelforge. 

In [3]:
from modelforge.curate.curate import SourceDataset, AtomicNumbers, Positions, Energies, Forces, MetaData

from openff.units import unit

import numpy as np

### Set up a new dataset
To start, we will create a new instance of the `SourceDataset` class to store the dataset.

In [4]:
new_dataset = SourceDataset("test_dataset")

### Add a record
Add a new record to the dataset, giving it a unique name (as a string).  The name provided will be used for adding/fetching properties to/from the record.

In [5]:
new_dataset.add_record('mol1')

### Define properties
Each record must include a few basic elements to be considered complete, namely:
- atomic numbers
- positions
- energies
  
Records may of course contain other properties/metadata, but this is the minimal set of information used in modelforge during training. The curate packages provides pydantic models for these and other common properties that appear in datasets.  

#### Defining atomic numbers
Let us first start by considering how to initialize atomic numbers, in this case for an example CH molecule:

In [6]:
atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))

The array that is should have shape (n_atoms, 1).  An error will be raised if `len(value.shape) != 2` or `value.shape[1] != 1`. 

`AtomicNumbers` can accept either a numpy array or a python list as input (it will be converted to a numpy array internally). The following syntax will produce an equivalent instance:

In [7]:
atomic_numbers = AtomicNumbers(value=[[1], [6]])

#### Defining positions

To define positions, we will use the `Positions` pydantic model.  Since positions should have units associated with them, they must also be set at the time of initialization. 

Units can be passed as an openff.units `Unit` or a string that can be understood by openff.units. An error will be raised if units are not defined. 

Positions are a per_atom property and thus must be a 3d array with shape (n_configs, n_atoms, 3).
If `value.shape[2] !=3` or `len(value.shape) != 3`, this will raise an error.  



In [8]:
positions = Positions(
    value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]), 
    units="nanometer"
)

#### Defining energies 
To define energies, we will use the `Energies` pydantic model; as with positions, units must also be set.  

Note, energy is a per_system property and thus the shape of the input array must be (n_configs, 1); an error will be raised if `value.shape[1] !=1` or `len(value.shape) != 2`.

In [9]:
energies = Energies(
    value=np.array([[0.1]]), 
    units=unit.hartree
)

#### Other properties

Pydantic models have also been defined for other common properties:
- `Forces`
- `PartialCharges`
- `TotalCharge`
- `DipoleMoment`
- `QuadrupoleMoment`
- `Polarizability`
- `MetaData`

Note, each of thes emodels inherits from a more general `RecordProperty` pydantic model; this model can be used to define any additional properties, but requires the user to provide the classification (e.g., per_atom, per_system) and the type (for the purposes of unit conversion, e.g., length, energy, force, charge, etc.). This will be discussed separately.

### Add properties to a record

Having defined properties we can now add them to the record. properties can be added individually:

In [10]:
new_dataset.add_property(
    record_name="mol1", 
    property=atomic_numbers
)

Properties may also be added to the record as a list:

In [11]:
new_dataset.add_properties(
    record_name="mol1", 
    properties=[positions, energies]
)

By default when instantiating a new `SourceDataset` instance, `append_property = False`.
If `append_property == False`, an error will be raised if you try to add a property with the same name more than once to the same record. This ensures we do not accidentally overwrite data in a record.

They following will produce a ValueError because atomic numbers have already been set for the record

In [12]:
new_dataset.add_property(
    record_name="mol1", 
    property=atomic_numbers
)

ValueError: Atomic numbers already set for record mol1

### Viewing a record
We can easily view an individual record using the `get_record` function in the `SourceDataset` class.

In [13]:
mol1_record = new_dataset.get_record(record_name="mol1")
print(mol1_record)

name: mol1
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties (['energies']):
 -  name='energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: ([])



The record can also be exported to a dict. 

In [14]:
mol1_record.to_dict()

{'name': 'mol1',
 'n_atoms': 2,
 'n_configs': 1,
 'atomic_numbers': AtomicNumbers(name='atomic_numbers', value=array([[1],
        [6]]), units=<Unit('dimensionless')>, classification='atomic_numbers', property_type='atomic_numbers', n_configs=None, n_atoms=2),
 'per_atom': {'positions': Positions(name='positions', value=array([[[1., 1., 1.],
          [2., 2., 2.]]]), units=<Unit('nanometer')>, classification='per_atom', property_type='length', n_configs=1, n_atoms=2)},
 'per_system': {'energies': Energies(name='energies', value=array([[0.1]]), units=<Unit('hartree')>, classification='per_system', property_type='energy', n_configs=1, n_atoms=None)},
 'meta_data': {}}

### Validating a record
Within these readouts, we see `n_atoms` and `n_configs` reported.  `n_atoms` is calculated from the dimensions of the atomic numbers; validation is then performed for all per_atom properties to ensure that all properties in the record have the same number of atoms.  Similarly, we validate that all per_system and per_atom properties have the same value of `n_configs` (determined by the first index of the shape of their arrays).  If the values were inconsistent, a descriptive message reporting the shape of each of the arrays would be provided. 

This validation can be triggered manually:

In [15]:
mol1_record.validate()

True

More complete validation can be performed at the dataset level. This validation includes checking for each record that:

- number of atoms is consistent
- number of configurations is consistent
- validation of units (e.g., that the unit provided for Positions is a length),
- ensuring that at minimum, atomic numbers, positions, and energies have been defined in the dataset

This can be done for individual records or on the entire dataset:

In [16]:
new_dataset.validate_record("mol1")

True

In [17]:
new_dataset.validate_records()

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4211.15it/s]
[32m2025-01-13 10:17:32.144[0m | [1mINFO    [0m | [36mmodelforge.curate.curate[0m:[36mvalidate_records[0m:[36m890[0m - [1mAll records validated successfully.[0m


True

### Saving to an HDF5 file

To save ths to an hdf5 file, we call the `to_hdf5` function of the `SourceDataset` class, passing the output path and filename. This will automatically perform the validation discussed above before we write to the file. 

Additionally, when writing the file, it will convert records to a consistent unit system (by default, kilojoules_per_mole and nanometers are the base unit system for energy and distance).

In [18]:
new_dataset.to_hdf5(file_path="./", file_name="test_dataset.hdf5")

[32m2025-01-13 10:17:33.000[0m | [1mINFO    [0m | [36mmodelforge.curate.curate[0m:[36mto_hdf5[0m:[36m929[0m - [1mValidating records[0m
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1855.07it/s]
[32m2025-01-13 10:17:33.003[0m | [1mINFO    [0m | [36mmodelforge.curate.curate[0m:[36mvalidate_records[0m:[36m890[0m - [1mAll records validated successfully.[0m
[32m2025-01-13 10:17:33.003[0m | [1mINFO    [0m | [36mmodelforge.curate.curate[0m:[36mto_hdf5[0m:[36m932[0m - [1mWriting records to HDF5 file[0m
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████

## Defining multiple properties of the same type

In the examples above, we did not provide a name to any of the properties we defined, intead using the default name in the pydantic model.  In some cases, we may wish to change this name, e.g., to match the name used in the original dataset or to be able to define multiple different energies (e.g., different contributions to the total energy, or calculated with different levels of theory). 

For example, let us consider that our dataset also includes a separate entry for the contribution of dispersion and we wish to store this with name 'energies_disp'.  We will add this to record "mol`" in the first dataset we defined.

In [19]:
disp_energy = Energies(name='energies_disp', value=np.array([[0.03]]), units=unit.hartree)

new_dataset.add_property(
    record_name="mol1", 
    property=disp_energy
)

In [20]:
new_dataset.get_record('mol1')

name: mol1
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties (['energies', 'energies_disp']):
 -  name='energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
 -  name='energies_disp' value=array([[0.03]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: ([])

Note validation to ensure that energies are included, only looks at the "type", rather than looking for a specific string in the "name" that has been provided; it only looks for an instance of the Energies pydantic model has been included.

## Creating an appendable dataset

In some cases, we may not have data for all configurations saved within the same array (e.g., when fetching data from qcarchive). To aid in these dataset types, we can initialize our instance of `SourceDataset` with `append_property=True`.  In these cases, rather than providing an error if the same property is added twice to the dataset, we will instead append the data to the existing array. 

In [21]:
appendable_dataset = SourceDataset(dataset_name="appendable", append_property=True)

For simplicity, let us reuse the properties already defined above and add these to the new dataset.  Note if we do not call `add_record` first, it will automatically create the recrod if it does not exist. 

In [22]:
appendable_dataset.add_properties("mol2", [energies, atomic_numbers, positions])

[32m2025-01-13 10:17:35.979[0m | [1mINFO    [0m | [36mmodelforge.curate.curate[0m:[36madd_property[0m:[36m582[0m - [1mRecord with name mol2 does not exist in the dataset. Creating it now.[0m


Let us now examine the contents of the record, where we see that we only have a single configuration. 

In [23]:
appendable_dataset.get_record('mol2')

name: mol2
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties (['energies']):
 -  name='energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: ([])

If we add the energies and positions a second time, we should see that we now have 2 configurations.

In [24]:
appendable_dataset.add_properties("mol2", [energies, positions])

In [25]:
appendable_dataset.get_record('mol2')

name: mol2
* n_atoms: 2
* n_configs: 2
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]],

       [[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
* per-system properties (['energies']):
 -  name='energies' value=array([[0.1],
       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: ([])

If we try to append a property that does not have the same number of atoms, an error will be raised. For example, below we try to append positions for a configuration with 3 atoms, not 2. 

In [26]:
positions2 = Positions(value=[[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 3.0]]], units="angstrom")
appendable_dataset.add_property("mol2",  positions2)

AssertionError: 

In our previous definition of energies in hatree and positions in nanometer; if we were to now define properties in different, yet compatible units, these values will be automatically converted to the existing units before appending.  

In [27]:
energies2 = Energies(value=np.array([[0.1]]), units=unit.kilojoules_per_mole)
positions2 = Positions(value=[[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]], units="angstrom")
appendable_dataset.add_properties("mol2", [energies2, positions2])

If we now print the contents, we can see we now have 3 configurations, where the final values in the position array are a factor of 0.1 smaller as the base unit was nanometers and we defined above in angstrom; similarly the final energy in energies has been appropriate converted from kilojoules per mole to hartree to match the previously defined unit. 

In [28]:
appendable_dataset.get_record('mol2')

name: mol2
* n_atoms: 2
* n_configs: 3
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties (['positions']):
 -  name='positions' value=array([[[1. , 1. , 1. ],
        [2. , 2. , 2. ]],

       [[1. , 1. , 1. ],
        [2. , 2. , 2. ]],

       [[0.1, 0.1, 0.1],
        [0.2, 0.2, 0.2]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=3 n_atoms=2
* per-system properties (['energies']):
 -  name='energies' value=array([[1.00000000e-01],
       [1.00000000e-01],
       [3.80879885e-05]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=3 n_atoms=None
* meta_data: ([])

## Unit system
When setting for a record properties, users can specify any unit that is:
- (1) supported by openff.units
- (2) compatible with the parameter type (i.e., Positions expect a unit of length).

Bullet 2 is assessed by comparing to the default values in the `UnitSystem` class (note we are not making any unit conversions at the point of initializing a record, just checking for compatibility). 

The following will fail validation because we expect, e.g., positions to be defined in distance units. 

In [29]:
pos = Positions(value=[[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0], [3.0, 3.0, 3.0]]], units=unit.angstrom*unit.angstrom)


ValidationError: 1 validation error for Positions
  Value error, Unit angstrom ** 2 of positions are not compatible with the property type length.
 [type=value_error, input_value={'value': [[[1.0, 1.0, 1....<Unit('angstrom ** 2')>}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/value_error

When defining a `SourceDataset` instance, the default unit system is used unless otherwise specified. This unit system is defined in the class `UnitSystem` providing default values for the set of predefined properties pydantic models.  Values will be converted to the specified unit system when writing to an HDF5 file.   

In [30]:
from modelforge.curate.curate import UnitSystem
print(UnitSystem())

unit_system_name : default
length : nanometer
force : kilojoule_per_mole / nanometer
energy : kilojoule_per_mole
charge : elementary_charge
dipole_moment : elementary_charge * nanometer
quadrupole_moment : elementary_charge * nanometer ** 2
polarizability : nanometer ** 3
atomic_numbers : dimensionless
dimensionless : dimensionless


A user can override any of the values with the unit system as well as adding new properties.  For example

In [32]:
units = UnitSystem()
units.length = unit.angstrom
units.add_property_type('pressure', unit.atmosphere)
print(units)

unit_system_name : default
length : angstrom
force : kilojoule_per_mole / nanometer
energy : kilojoule_per_mole
charge : elementary_charge
dipole_moment : elementary_charge * nanometer
quadrupole_moment : elementary_charge * nanometer ** 2
polarizability : nanometer ** 3
atomic_numbers : dimensionless
dimensionless : dimensionless
pressure : standard_atmosphere
