# modelforge.curate : properties

This notebook will focus on a more thorough examination of defining properties.

In [1]:
from modelforge.curate import Record, SourceDataset
from modelforge.curate.units import GlobalUnitSystem
from modelforge.curate.properties import AtomicNumbers, Positions, Energies, Forces, MetaData

from openff.units import unit

import numpy as np

## Properties

Each property inherits from the `PropertyBaseClass` pydantic model and has the following fields:

- `name` : str : unique identifier for the property
- `value` : ndarray : array containing the values (note, the `MetaData` property allows this to be  set to a str, int, float, and list in addition to a numpy array) 
- `units` : unit.Unit : OpenFF.units 
- `classification` : PropertyClassification enum : specifies if the property is "atomic_numbers", "per_atom", "per_system", or "meta_data"
- `property_type` : PropertyType enum: specifies the type of property (e.g., length, energy, force, etc.) used for validating the specified `units`

`classification` and `property_type` are inherent to the property and do not need to be modified when a property is instantiated.  

While a default value is set for `name` field for each property (e.g., "energies" for the `Energies` property), this value typically should be set at the time of instantiation to a unique and appropriate key. Setting the `name` field will be essentialy for records that contain, e.g., multiple energy entries (e.g., total_energy, dispersion_energy, electronic_energy, etc.). 

The following demonstrates defining a record with properties "atomic_numbers", "positions", "total_energies", "dispersion_energies", and "smiles"

In [2]:
atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))

positions = Positions(
    value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]), 
    units="nanometer"
)

total_energies = Energies(
    name="total_energies",
    value=np.array([[1]]), 
    units=unit.hartree
)

dispersion_energies = Energies(
    name="dispersion_energies",
    value=np.array([[0.1]]), 
    units=unit.hartree
)   

smiles = MetaData(name='smiles', value='[CH]')

record_mol1 = Record(name='mol1')
record_mol1.add_properties([total_energies, dispersion_energies, atomic_numbers, positions, smiles])

print(record_mol1)

name: mol1
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
 -  name='total_energies' value=array([[1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
 -  name='dispersion_energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None



As noted in the "basic_usage.ipynb" notebook, the `name` field is used as a unique key.  An error will be raised if we try to add a property with the same key twice. E.g., the following will raise an error as we have already set the "total_energies".

In [3]:
record_mol1.add_property(total_energies)

ValueError: Property with name total_energies already exists in the record mol1.Set append_property=True to append to the existing property.

## Appending properties

In some cases, we may not have data for all configurations available to use when instantiating a property.  For example, the positions for different configurations may exist in different .xyz files.  To handle these cases, the `Record` class can be instantiated with `append_property` set to `True`.  In such cases, adding a property a second time will append the new data to the existing array. 

For example, the following will use initialize the same `Record` as above, but allowing properties to be appended:abs


In [4]:
record_mol1_append = Record(name='mol1', append_property="True")
record_mol1_append.add_properties([total_energies, dispersion_energies, atomic_numbers, positions, smiles])

Now, if we add "total_energies" a second time, this will not raise an error, rather it will append the energy to the existing array.

In [5]:
record_mol1_append.add_property(total_energies)

If print the record we will now see that the "total_energies" property now contains `value` = `[[1], [1]]` and reports n_configs = 2.  

In [6]:
print(record_mol1_append)



name: mol1
* n_atoms: 2
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
 -  name='total_energies' value=array([[1],
       [1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
 -  name='dispersion_energies' value=array([[0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=None



Note, this produces several warnings because the number of configurations is now not consistent in the record (printing the record calls the validate function in the class)

In [7]:
record_mol1_append.validate()



False

To resolve this we simply can add the "positions" and "dispersion_energies" a second time as well:

In [8]:
record_mol1_append.add_properties([dispersion_energies, positions])

In [9]:
print(record_mol1_append)

name: mol1
* n_atoms: 2
* n_configs: 2
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]],

       [[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
 -  name='total_energies' value=array([[1],
       [1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
 -  name='dispersion_energies' value=array([[0.1],
       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' prope

When appending to an existing property, the code will first check to see if the shapes of the array are compatible.  For example, if we try to add positions for a molecule with a different number of atoms, this will produce an error, as the shapes of the arrays are not compatible. 

In [10]:
positions2 = Positions(value= [[[1,1,1], [2,2,2], [3,3,3]]], units=unit.nanometer)

record_mol1_append.add_property(positions2)

AssertionError: mol1: n_atoms of positions does not: 3 != 2.

The units are also compared and converted if necessary before appending.  For example, we defined energy in units of hartree above;  if we define energy in a different unit and append, it will automatically be converted to hartrees. 

In [11]:
total_energies2 = Energies(
    name="total_energies",
    value=np.array([[1]]), 
    units=unit.kilocalories_per_mole
)
record_mol1_append.add_property(total_energies2)

print(record_mol1_append)



name: mol1
* n_atoms: 2
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions']):
 -  name='positions' value=array([[[1., 1., 1.],
        [2., 2., 2.]],

       [[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
* per-system properties: (['total_energies', 'dispersion_energies']):
 -  name='total_energies' value=array([[1.       ],
       [1.       ],
       [0.0015936]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=3 n_atoms=None
 -  name='dispersion_energies' value=array([[0.1],
       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classificat

## Adding properties directly to a dataset

Rather than creating an instance of the `Record` class and adding this to the dataset, we can use the `SourceDataset` class directly. The functions in `SourceDataset` effectively just provide wrappers to the functions that exist within the `Record` class. As such, both approaches are equivalent but one may be more convenient depending on the structure of the original dataset that is being curated. 

The following code performs the same functionality in the two ways. First we will define the common elements (i.e., properties):

In [12]:
#define the datset
new_dataset = SourceDataset('test_dataset')

# define the properties
atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))
positions = Positions(
    value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]]]), 
    units="nanometer"
)

total_energies = Energies(
    name="total_energies",
    value=np.array([[1]]), 
    units=unit.hartree
)



Approach 1: Create a Record, add properties to the Record, add Record to the dataset

In [13]:
record_mol1 = Record("mol1")
record_mol1.add_properties([atomic_numbers, positions, total_energies])

new_dataset.add_record(record_mol1)

Approach 2: Create a Record within the dataset, add properties to this record within the dataset

In [14]:
new_dataset.create_record('mol2')
new_dataset.add_properties("mol2", [atomic_numbers, positions, total_energies])

The dataset can also be instantiated with `append_property` set to `True`; the wrapper function within the dataset provides the same functionality as when interacting directly with a record. 

In [16]:
appendable_dataset = SourceDataset(name="appendable", append_property=True)