# modelforge.curate : Record and SourceDataset

This notebook focuses on functionality within the `Records` and `SourceDataset` classes.

In [1]:
from modelforge.curate import Record, SourceDataset
from modelforge.utils.units import GlobalUnitSystem
from modelforge.curate import AtomicNumbers, Positions, Energies, Forces, MetaData

from openff.units import unit

import numpy as np

## Initializating records and datasets
To start, we will create a new instance of the `SourceDataset` class to store the dataset. We will populate this with 10 records, each with 3 configurations.

In [2]:
new_dataset = SourceDataset(name="test_dataset")

for i in range(0,10):
    record = Record(f"mol_{i}")
    
    atomic_numbers = AtomicNumbers(value=np.array([[1], [6]]))

    positions = Positions(
        value=np.array([[[i, 1.0, 1.0], [2.0, 2.0, 2.0]],
                        [[i, 2.0, 1.0], [2.0, 2.0, 2.0]],
                        [[i, 3.0, 1.0], [2.0, 2.0, 2.0]]]), 
        units="nanometer"
    )
    
    total_energies = Energies(
        name="total_energies",
        value=np.array([[i], 
                        [i+0.1], 
                        [i+0.2]]), 
        units=unit.hartree
    )
    forces = Forces(
        name="forces",
        value=np.array([[[1.0, 1.0, 1.0], [2.0, 2.0, 2.0]],
                        [[10.0, 2.0, 1.0], [2.0, 2.0, 2.0]],
                        [[20.0, 3.0, 1.0], [2.0, 2.0, 2.0]]]), 
        units = unit.kilocalorie_per_mole/unit.nanometer,
    )
    record.add_properties([atomic_numbers, positions, total_energies, forces])
    new_dataset.add_record(record)



### Examining the dataset
Let us examine the dataset:

In [3]:
print("total configurations: ", new_dataset.total_configs())
print("total records: ", new_dataset.total_records())

import pprint
print("dataset summary:")
pprint.pprint(new_dataset.generate_dataset_summary())


total configurations:  30
total records:  10
dataset summary:
{'name': 'test_dataset',
 'properties': {'atomic_numbers': {'classification': 'atomic_numbers'},
                'forces': {'classification': 'per_atom',
                           'units': 'kilojoule_per_mole / nanometer'},
                'positions': {'classification': 'per_atom',
                              'units': 'nanometer'},
                'total_energies': {'classification': 'per_system',
                                   'units': 'kilojoule_per_mole'}},
 'total_configurations': 30,
 'total_records': 10}


### Extracting/Updating records

#### Print a record
We can can print out the summary of any invidual record using the `print_record` function. 

In [4]:
new_dataset.print_record("mol_0")

name: mol_0
* n_atoms: 2
* n_configs: 3
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]],

       [[0., 2., 1.],
        [2., 2., 2.]],

       [[0., 3., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=3 n_atoms=2
 -  name='forces' value=array([[[ 1.,  1.,  1.],
        [ 2.,  2.,  2.]],

       [[10.,  2.,  1.],
        [ 2.,  2.,  2.]],

       [[20.,  3.,  1.],
        [ 2.,  2.,  2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=3 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0. ],
       [0.1],
       [0.2]]) units=<Unit('hartree')> classifica

#### Extract a copy of a record
We can extract a copy of any record using the `get_record` function. 

In [5]:
record_temp = new_dataset.get_record("mol_0")
print(record_temp)

name: mol_0
* n_atoms: 2
* n_configs: 3
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]],

       [[0., 2., 1.],
        [2., 2., 2.]],

       [[0., 3., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=3 n_atoms=2
 -  name='forces' value=array([[[ 1.,  1.,  1.],
        [ 2.,  2.,  2.]],

       [[10.,  2.,  1.],
        [ 2.,  2.,  2.]],

       [[20.,  3.,  1.],
        [ 2.,  2.,  2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=3 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0. ],
       [0.1],
       [0.2]]) units=<Unit('hartree')> classifica

#### Update a record in the dataset
Since `get_record` returns a copy, if the record is changed, the `update_record` function needs to be used to updated it within the dataset.  Here we can add metadata to this record and update it. 

In [6]:
smiles = MetaData(name='smiles', value='[CH]')

record_temp.add_property(smiles)

new_dataset.update_record(record_temp)

new_dataset.print_record("mol_0")

name: mol_0
* n_atoms: 2
* n_configs: 3
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]],

       [[0., 2., 1.],
        [2., 2., 2.]],

       [[0., 3., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=3 n_atoms=2
 -  name='forces' value=array([[[ 1.,  1.,  1.],
        [ 2.,  2.,  2.]],

       [[10.,  2.,  1.],
        [ 2.,  2.,  2.]],

       [[20.,  3.,  1.],
        [ 2.,  2.,  2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=3 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0. ],
       [0.1],
       [0.2]]) units=<Unit('hartree')> classifica

#### Removing a record from a dataset

We can remove a record using the `remove_record` function in the `SourceDataset` class. 

In [7]:
print("total_records: ", new_dataset.total_records())
new_dataset.remove_record("mol_9")
print("total_records: ", new_dataset.total_records())

total_records:  10
total_records:  9


#### Slicing a record

We can slice a record, returning a copy of the record that only includes  subset of configurations.  This will be applied to all properties with the record. 

This can be done at the level of a record or called via a wrapping function in the dataset. 

the code below will return the first 2 records out of the 3 total. 

In [8]:
record_sliced = record_temp.slice_record(min=0, max=1)

print(record_sliced)

name: mol_0
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
 -  name='forces' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=1 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0.]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=No

In [9]:
record_sliced = new_dataset.slice_record("mol_0", min=0, max=1)
print(record_sliced)

name: mol_0
* n_atoms: 2
* n_configs: 1
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=1 n_atoms=2
 -  name='forces' value=array([[[1., 1., 1.],
        [2., 2., 2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=1 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0.]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=1 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' value='[CH]' units=<Unit('dimensionless')> classification='meta_data' property_type='meta_data' n_configs=None n_atoms=No

#### Limiting to a subset of atomic numbers

We can query if a record contains atomic numbers within a specified set using the `contains_atomic_numbers` in the `Records` class.  This will return true if the atomic numbers in the record are all represented in the provided array and false if any atomic numbers in the record are not included in the provided array. 

Note, this function will not typically need to be called directly, as the `subset_dataset` function in the `SourceDataset` provides a wrapper for this functionality on the entire dataset (discussed separately later). 

In [10]:
record_temp.contains_atomic_numbers(np.array([1,6]))

True

In [11]:
record_temp.contains_atomic_numbers(np.array([1,8]))

False

#### Removing high force configurations

Often, we wish to remove configurations with very high forces.  The `remove_high_force_configs` function in the `Records` class can be used to return a copy of the record, excluding those configurations where the magnitude of the force exceeds the specified threshold.   By default, this will filter using the name "forces" (i.e., it will look for a property with name "forces" within the record); this can be toggled if the force property is named differently. 

Note, this function will not typically need to be called directly, as the `subset_dataset` function in the `SourceDataset` provides a wrapper for this functionality on the entire dataset (discussed separately later).  

For example, below we can filter out any configurations with a force greater than 15, which will eliminate the last configuration of the record (see initialization above). 

In [12]:
record_max_force = record_temp.remove_high_force_configs(unit.Quantity(15, unit.kilocalorie_per_mole/unit.nanometer), "forces")

print(record_max_force)

name: mol_0
* n_atoms: 2
* n_configs: 2
* atomic_numbers:
 -  name='atomic_numbers' value=array([[1],
       [6]]) units=<Unit('dimensionless')> classification='atomic_numbers' property_type='atomic_numbers' n_configs=None n_atoms=2
* per-atom properties: (['positions', 'forces']):
 -  name='positions' value=array([[[0., 1., 1.],
        [2., 2., 2.]],

       [[0., 2., 1.],
        [2., 2., 2.]]]) units=<Unit('nanometer')> classification='per_atom' property_type='length' n_configs=2 n_atoms=2
 -  name='forces' value=array([[[ 1.,  1.,  1.],
        [ 2.,  2.,  2.]],

       [[10.,  2.,  1.],
        [ 2.,  2.,  2.]]]) units=<Unit('kilocalorie_per_mole / nanometer')> classification='per_atom' property_type='force' n_configs=2 n_atoms=2
* per-system properties: (['total_energies']):
 -  name='total_energies' value=array([[0. ],
       [0.1]]) units=<Unit('hartree')> classification='per_system' property_type='energy' n_configs=2 n_atoms=None
* meta_data: (['smiles'])
 -  name='smiles' va

### Subsetting a dataset

`SourceDataset` includes a function called `subset_dataset` which returns a copy of the dataset with various filters applied. The filters that can be applied include:

- total_records: Maximum number of records to include in the subset.
- total_configurations: Total number of conformers to include in the subset.
- max_configurations_per_record: Maximum number of conformers to include per record. If None, all conformers in a record will be included.
- atomic_numbers_to_limit:  An array of atomic species to limit the dataset to. Any molecules that contain elements outside of this list will be igonored
- max_force: If set, configurations with forces greater than this value will be removed.
- final_configuration_only: If True, only the final configuration of each record will be included in the subset.

Note, `total_records` and `total_configurations` can not be used in conjunction. 

Below, we create a new dataset that will limit to a max number of 2 configurations per record, and a total of 10 total configurations. 

In [13]:
dataset_subset = new_dataset.subset_dataset(new_dataset_name="dataset_subset", total_configurations=10, max_configurations_per_record=2)

print(dataset_subset.total_records())
print(dataset_subset.total_configs())



5
10


## SourceDataset backend sqlite database

The `SourceDataset` class stores records within a sqlite database rather than in memory.  The name and location of this database can be set at instantiation of the dataset.  If these are not set, the default localation will be "./" and the database will be named based upon the name of the dataset (replacing any spaces with an underscore).  The code below would produce the same dataset as the default if no values were provided. 

In [14]:
new_dataset2 = SourceDataset(name="new dataset2", local_db_dir="./", local_db_name="new_dataset2.sqlite")

The use of a sqlite backend not only reduces the memory footprint, but also allows a dataset to be loaded from an existing database.  Being able to load from the database allows us to avoid having to go through the processing of a dataset (i.e., setting up individual properties, Records, etc.). 

The following code will load up the subsetted dataset generated in the prior cells:

In [15]:
new_dataset2  = SourceDataset(name="new dataset2", local_db_dir="./", local_db_name="dataset_subset.sqlite", read_from_local_db=True)

print(new_dataset2.total_records())
print(new_dataset2.total_configs())

5
10


When subsetting a dataset, we can also specify the name and location of the database that will be generated. Otherwise, the same default behavior is used (i.e., based on dataset name).  This function will return an error if the new and old datasets have the same name. 