# BuRNN tutorial: Preparing of the training dataset and NN model training

In this tutorial, methanol in water is used as the model system. In the context of BuRNN, we will have methanol in the inner region, while water molecules in the 0.5 nm radius around the methanol will form the buffer region. The rest of the box will be the outer region. The focus of the tutorial will be the generation of the training data set and the training of the neural network (NN) model.
## Training dataset preparation
The starting point for the generation of the database of QM structures for NN model training are snapshots from the GROMOS MD simulation, as described in the original paper. The QM region of the system (inner + buffer region) is extracted by the GROMOS program filter from each saved snapshot of the initial MD simulation. The filter program produces an output file containing co-ordinates of all extracted QM regions (for all snapshots). This is the starting point for the tutorial. Look at the example of the filter [output file](https://github.com/LierB/gromos_tutorial_livecoms/blob/burnn_tutorial_rc/tutorial_files/t_06/train_dataset_tutorial/filter_output_example.cnf). The first step is to extract individual QM regions into separated cnf files (see examples [here](https://github.com/LierB/gromos_tutorial_livecoms/tree/burnn_tutorial_rc/tutorial_files/t_06/train_dataset_tutorial/separated_cnf_files), stored in buffer_pls_inner_region_cnf_files_example directory). The next step is to calculate QM energies and gradients for all the snapshots.

In this tutorial, an example QM training data set has been generated using the semi-empirical program [MOPAC](http://openmopac.net/manual/index.html). In practice, the choice of the QM software to be used is entirely up to the user. However, in order to prepare the training data set in a reasonable time, it is necessary to automate the QM calculations. In addition, it takes a certain amount of time to perform the QM calculations for all the snapshots generated by MD. Therefore, a small [QM data set](https://github.com/LierB/gromos_tutorial_livecoms/tree/burnn_tutorial_rc/tutorial_files/t_06/train_dataset_tutorial/QM_dataset_example) has been prepared in advance for you. The entire process is described in the following section. 

### QM dataset generating using MOPAC
QM energies and gradients were calculated for all GROMOS MD snapshots using [MOPAC software](http://openmopac.net/manual/index.html). In order to obtain force field independent data (real QM data), the data set was extended by the snapshots from the MOPAC energy minimisation. Therefore, the first step was to run energy minimisation in MOPAC for each GROMOS MD snapshot. The snapshots were converted from cnf to the MOPAC input format (.mop). Then the energy minimisation was run with the following keywords: PM7 GRAD AUX(PRECISION = 9, XP, XS, XW) CHARGE=0. The next step was to extract the individual minimization steps from the [MOPAC minimization output files](https://github.com/LierB/gromos_tutorial_livecoms/tree/burnn_tutorial_rc/tutorial_files/t_06/train_dataset_tutorial/MOPAC_minimzation_files) (.aux). First, the correct geometry of the SPC water molecules (in the buffer region) had to be restored. The latter was done by running a single step GROMOS MD (the integrator was turned off). This (re)applied SHAKE to the water molecules within the buffer region. Then buffer region was extracted from the individual minimization snapshots (with SHAKEn waters). The final step was to perform single-point QM calculations (keywords: 'PM7 GRAD AUX(PRECISION = 9, XP, XS, XW) PRECISE 1SCF CHARGE=0) separately for each inner plus buffer region and the buffer region itself. A self-written Python module automated the whole process.

The final output was stored in two directories. [The first one](https://github.com/LierB/gromos_tutorial_livecoms/tree/burnn_tutorial_rc/tutorial_files/t_06/train_dataset_tutorial/QM_dataset_example/inner_plus_buffer_region) contained the MOPAC output files (.aux) for the inner plus buffer region whereas [the second one](https://github.com/LierB/gromos_tutorial_livecoms/tree/burnn_tutorial_rc/tutorial_files/t_06/train_dataset_tutorial/QM_dataset_example/buffer_region) for the buffer region. Note that by using MOPAC energy minimization, we have increased the size of the dataset from 2 ([the original number of MD snapshots](https://github.com/LierB/gromos_tutorial_livecoms/tree/burnn_tutorial_rc/tutorial_files/t_06/train_dataset_tutorial/separated_cnf_files)) to 860. This demonstrates the potential of energy minimization to significantly increase the size of the training data. On the other hand, we have to consider that energy minimization can produce a lot of very similar structures, especially at the end of the minimization process. The latter means that clustering of training structures may be recommended prior to model training.

### Building ASE database
Our QM data needs to be stored in the [ASE databse](https://wiki.fysik.dtu.dk/ase/ase/db/db.html) to be read by SchNetPack (the software we are going to use to train the NN model).In the following part of the tutorial we will build ASE database from our training snapshots, then the database will be used for NN model training. For that purpose we will use in-house made module additional_spk_utils. Firstly, we will need additional_spk_utils.Build_AseDb_From_Mopac_Aux class to create an ASE database. The class takes the following arguments:
- complex_path = path to the mopac output files with the whole QM regions (inner + buffer regions, in our case MOPAC_results/minimization/buffer_pls_inner/aux_out)
- buffer_path = path to the mopac output files with the buffer regions (in our case MOPAC_results/minimization/buffer/aux_out)
- inner_region_size = number of atoms in the inner region
- db_name = name of the resulting ASE dataset
- db_properties = The properties which will be stored in the database (in our case: complex energy, buffer energy, energy (to be predicted), forces). Energy and forces are mandatory for the BuRNN approach
- metadata = metadata
- reference_energies = reference energies (energies in vacuum) for all the components in the QM region.

### Energy Normalization
Here it is good time to describe the last argument reference_energies more in detail. As was mentioned above, in the BuRNN approach we use NN model to predict interaction energy between inner and buffer region in stead of the absolute energy (see [original BuRNN paper](https://pubs.acs.org/doi/full/10.1021/acs.jpclett.2c00654)). The reference energies are used to calculate the interaction energy. Now consider the example. We have inner + buffer region composed of Methanol (inner region) and 15 water molecules (buffer region). Interaction energy is calculate in the following way:
- The absolute energies for inner + buffer regions and buffer regions were calculated by MOPAC.
- The first number in reference_energies argument coresponds to the energy of Methanol in vacuum, whereas the second represents the energy of one water molecule in vacuum (both were calculated by MOPAC).
- Firstly we sum the energies of the individual components of both regions (inner + buffer and buffer) in vacuum.
- for our example the summation is done in the following way:
    - inner + buffer region = reference energy of Methanol + 15 * reference energy of water molecule
    - buffer region = 15 * reference energy of water molecule
- The energies calculetad in the previous step are then subtracted from the original energies calculated MOPAC.

What did we get in the last step? We got the interaction energies. For the inner + buffer region we got the interaction energies between methanol and between water molecules themselves. For the buffer region we got the interaction energy between water molecules only. By subtracting these two numbers we obtained the interaction energy between methanol and water molecules (i.e. between the inner and buffer regions). This number is hidden under the name energy in the db_properties argument. A similar procedure is done for the forces. The difference is that in case of forces we are interested in atomic contributions. Therefore, the normalization of the forces is done by subtracting the values for the inner + buffer region and the buffer region. The values for the inner region are passed without subtraction (they are not in the buffer region, so there is nothing to subtract). The subtracted values are hidden behind the name forces in the db_properties argument.

In [1]:
import additional_spk_utils
import os

class additional_spk_utils.Build_AseDb_From_Mopac_Aux also takes the argument metadata, where you can describe the database. It is good to provide important information about the database to make it clear for other users or for yourself in the future.

In [2]:
# Example of dataset description
metadata = {'System' : 'MeOH in water',
            'num. of structures' : 860,
            'QM software' : 'MOPAC',
            'Energy minimization': 'Yes',
            'clustering' : 'No',
            'Energy units' : 'kcal/mol',
            'Force units' : 'kcal/mol/Angstrom',
            'distance units' : 'Angstrom'}

In [3]:
# building of ASE database
# NOTE: the program will remove the previous ASE database of the same name specified in db_name argument
additional_spk_utils.Build_AseDb_From_Mopac_Aux(complex_path=os.path.join('.', 'QM_dataset_example', 'inner_plus_buffer_region'),
                                                buffer_path=os.path.join('.', 'QM_dataset_example', 'buffer_region'),
                                                db_name='meoh_trial.db', 
                                                db_properties=['complex_energy', 'buffer_energy', 'energy', 'forces'], 
                                                inner_region_size=6, 
                                                reference_energies=[-48.9383958635117, -57.7996759075731],
                                                metadata=metadata).build_db()

KeyboardInterrupt: 

In [4]:
# to check that the database was created
db = additional_spk_utils.Build_AseDb(load_existing_database=True, db_name='meoh_trial.db').create_db()
print(f'Number of structures: {len(db)}')
print(f'Interaction energy for the first structure: {db.__getitem__(0)["energy"][0]:.2f} kcal/mol')

Number of structures: 667
Interaction energy for the first structure: -19.37 kcal/mol


In [7]:
# show metadata
db.get_metadata()

{'System': 'MeOH in water',
 'num. of structures': 860,
 'QM software': 'MOPAC',
 'Energy minimization': 'Yes',
 'clustering': 'No',
 'Energy units': 'kcal/mol',
 'Force units': 'kcal/mol/Angstrom',
 'distance units': 'Angstrom'}

## Model trainig

Dataset of the training structures was generated in the previous parts. Now we can proceed to the training of the NN model (machine learned potential). Oue model will be based on the [SchNet](https://pubs.aip.org/aip/jcp/article-abstract/148/24/241722/962591/SchNet-A-deep-learning-architecture-for-molecules?redirectedFrom=fulltext) architecture. For the training of the model we will use [SchNetPack](https://pubs.acs.org/doi/10.1021/acs.jctc.8b00908) package. 

The [SchNet](https://pubs.aip.org/aip/jcp/article-abstract/148/24/241722/962591/SchNet-A-deep-learning-architecture-for-molecules?redirectedFrom=fulltext) model is a convolutional neural network (CNN) with a continuous filter. It is very similar to the common CNNs used in image recognition, for instance. In contrast to the images, molecules cannot be described by the discrete matrix of pixels and thus the continuous filter is applied instead of a discrete one. The SchNet model is composed of two NNs. The main one is responsible for the prediction of the given property itself (input is the vector of atomic numbers for the given structure). The second one generates the filter for the convolution (input: positions of the individual atoms of the given structure). The main NN is divided into three main blocks. The first is the embedding layer which creates the feature vectors for the individual atoms within the structure (therefore the whole structure is represented by the 2D matrix of shape(number of atom-wise features, number of atoms)). The second part of the model is the series of the interaction blocks. One interaction block contains one convolutional layer. This block is responsible for creating the representation of the system. The last part is the output module which predicts the given property of the structure (in our case energy). If you are interested in SchNet architecture in more detail see original [paper](https://pubs.aip.org/aip/jcp/article-abstract/148/24/241722/962591/SchNet-A-deep-learning-architecture-for-molecules?redirectedFrom=fulltext).

In practice, the SchNet model can be trained using python script spk_run.py provided with the SchNetPack package. Here we will use an additional bash script train.sh (is already prepared for you) to do so. The script will run spk_run.py with the specified arguments. spk_run.py takes the following arguments:
- positional arguments:
    - mode: str=train
    - architecture: str=schnet
    - dataset: str=custom 
    - datapath: str, path to the ASE database
    - modelpath: str, path to the model to be created
- optional arguments:
    - --help
    - --cuda = use Nvidia GPU for training
    - --parallel = parallel training on more GPUs
    - --seed = random seed for torch and numpy
    - --overwrite = remove previous model directory
    - --split_path = path to your own npz file with data split
    - --split = train, validation split; the rest of the dataset is used for testing
    - --max_epochs = maximum number of training epochs (default 5000)
    - --max_steps = maximum number of training steps (default None)
    - --lr = learning rate (default 0.0001)
    - --lr_decay = learning rate decay (default 0.8)
    - --lr_min = minimal learning rate (default 1e-06)
    - --lr_patience = epochs without improvement before reducing the learning rate (default 25)
    - --logger = logger (default csv)
    - --log_every_n_epochs = log metrics every given number of epochs (default 1)
    - --check_point_interval = store checkpoint every n epochs (default 1)
    - --keep_n_checkpoints = number of checkpoints that will be stored (default 3)
    - --environment_provider_device = It is recommended to use CPU (default cpu)
    - --features = number of atomwise features (default 128)
    - --interactions = number of interaction blocks (default 3)
    - --cutoff_function = default cosine
    - --num_gaussians = number of gaussians to expand distances (default 50)
    - --normalize_filter = normalize convolution filters by number of neighbors
    - --property = property to be predicted (default energy)
    - --cutoff = cutoff (default 10.0)
    - --batch_size = batch size (default 100)
    - --environment_provider = environment provider for the dataset (default simple)
    - --derivative = derivative of the property to be predicted (default None)
    - --negative_dr = multiply derivatives by -1 for training (when forces are provided instead of gradients, default False)
    - --force = name of force property in the dataset (alias for the derivative + negative_dr, default None)
    - --contributions = contributions of dataset property to be predicted (default None)
    - --stress = train on stress tensor if not None (default None)
    - --aggregation_mode = mode for aggregating atomix properties (default sum)
    - --output_module = select output module for the selected property (default atomwise)
    - --rho = tradeoff weights between property and derivative (default {})

In our case, we will train only a small model as an example of the model training process in SchNetPack. The complete models for the BuRNN simulation are provided in the directory "models". In our example, we will use our database (argument datapath). We need to specify in which directory the resulting model will be stored (argument modelpath). The training dataset has to be split into the training, validation and testing parts. In our case, we will use a random split (80 % training, 10 % validation and 10 % testing data). The sizes of training and validation data are specified in the argument split, respectively. The rest of the dataset is used for the final testing of the model. The model will be trained for 2 training epochs. we will use 32 atom-wise features and 1 interaction block to obtain a very small model which will be trained very quickly. Specify the necesseary arguments in the train.sh script and run it from the command line. The resulting model will be stored in the specified directory. Moreover, the directory contains log file (log.csv) and the file with the arguments used for the model training (args.json). The script will also run the spk_run.py in evaluation mode. The result of the evaluation will be stored in the model directory (evaluation.txt file). The evaluation will be done on the test data.

The hardest part of the model training process is the proper selection of the training hyperparameters. In practice, it is usually done by trial/error approach (at least partially). The general recommendation is to start with the default values of the SchNet architecture. If the resulting model does not fulfill your requirements, you can try to modify the individual hyperparameters. The hyperparameters number of atom-wise features (features) and number of interactions blocks (interactions) are the most important ones. These hyperparameters define the number of parameters of the model and thus its complexity. Therefore, one should start with modifying of those hyperparameters.