## Part 4 — Advanced Training

In this tutorial we will cover some more advanced parameters for a realistic training of a potential.

In particular, we will cover:
* Tranining with forces
* Regularization
* Decaying learning rate
* Validating during training
* Computing Gvectors during training


In [1]:
# We start with the standard setup
import os

# Specify the absolute path to PANNA (or leave this relative path)
panna_dir = os.path.abspath('../..')

# In case you need to mount the drive
# from google.colab import drive
# drive.mount('/content/drive')
# panna_dir = '/content/drive/MyDrive/your_path_to_panna'

# Cleaning up path for command line
panna_cmdir = panna_dir.replace(' ', '\ ')

# Check if PANNA is installed, otherwise install it
try:
  import panna
  print("PANNA is installed correctly")
except ModuleNotFoundError:
  print("PANNA not found, attempting to install")
  !pip install panna_cmdir


PANNA is installed correctly


### 4.1 — Traning with forces

In part 2 we have performed training to learn the energy of the reference configurations.
But of course we can compute the force on each atom by differentiating the energy with respect to its position, and if we have a ground truth for these forces, we can use this information to improve our training.

In order to compute the derivative with respect to position, we need to know the derivative with respect to each component of each descriptor $G_j$:
$$F_i=\frac{\partial E}{\partial x_i}=\sum_j \frac{\partial E}{\partial G_j}\frac{\partial G_j}{\partial x_i}.$$
These terms will be added to the ``tfr`` in the data creation pipeline by including the flag ``include_derivatives = True`` in the ``[SYMMETRY_FUNCTION]`` card of the descriptor calculator and in the ``[CONTENT_INFORMATION]`` card of the packer.
Additionally, we can store all the possible derivatives, or only the elements different from zero (you can imagine that in a large cell each atom will only affect the descriptor of the few atoms in a cutoff sized sphere around itself). To switch from storing all data to the sparse format, the flag ``sparse_derivatives = True`` can be included along with the previous one.

For this tutorial, we have already created ``tfr`` files of a few configurations for you, with derivatives stored in a dense format. Please note that information about the derivative can take up a lot of space; for this reason we have limited this dataset to only a few water configurations. While this is not enough for any meaningful training, it is sufficient to showcase this training option.

Once we have the derivatives in the data, to use them (and reference forces) in training, it is sufficient to add the keyword ``forces_cost`` to the training parameters, with a value greater than zero.
Let's look at a sample input file, then run this short training.

In [12]:
!cat {panna_cmdir+'/doc/tutorial/input_files/mytrain_force.ini'}

[IO_INFORMATION]
data_dir = ./tutorial_data/train_force
train_dir = ./my_train_force
log_frequency = 10
save_checkpoint_steps = 50

[DATA_INFORMATION]
atomic_sequence = H, O
output_offset = -13.49, -562.1

[TRAINING_PARAMETERS]
batch_size = 10
learning_rate = 0.001
steps_per_epoch = 50
max_epochs = 2
forces_cost = 0.1

[DEFAULT_NETWORK]
g_size = 128
architecture = 128:32:1
trainable = 1:1:1


In [13]:
!cd {panna_cmdir+'/doc/tutorial/'}; python {panna_cmdir+'/src/panna/train.py'} --config ./input_files/mytrain_force.ini

2023-04-25 11:02:00.173646: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-04-25 11:02:00.173705: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
INFO - 
    ____   _    _   _ _   _    _           
   |  _ \ / \  | \ | | \ | |  / \     
   | |_) / _ \ |  \| |  \| | / _ \     
   |  __/ ___ \| |\  | |\  |/ ___ \    
   |_| /_/   \_\_| \_|_| \_/_/   \_\ 

 Properties from Artificial Neural Network Architectures

INFO - reading ./input_files/mytrain_force.ini
INFO - Found a default network!
INFO - This network size will be used as default for all species unless

As we can see, the MAE over the forces is also reported during training (and in the metrics file and tensorboard).

Please note that training with forces can be considerably slower than training with energy only. However, it allows us to produce considerably more accurate models when forces are important, e.g. for use as an interatomic potential.

### 4.2 — Other training options

We summarize here other options that can be useful during a real training.

#### Regularization
As in many neural network applications, it can be beneficial to add a small contribution to the loss function that keeps weights from growing. This is typically called a L1 (if it employs the ablsolute value of the weights) or L2 (if it employs the squares) regularization.

In PANNA, we can introduce L1 regularization with the keywords ``wscale_l1`` for weights and ``bscale_l1`` for biases (and similar for l2) in the training parameters, specifying the small weight to use for this correction.

#### Decaying learning rate
When training a model, and especially when finalizing it, it can be useful to gradually reduce the learning rate. In PANNA, we can employ an exponentially decreasing learning rate following the equation:
$$\alpha(t)=\alpha(0) r^{t/\tau}$$
where $t$ represents the training step.
To use this, in the training parameters we need to set ``learning_rate_constant`` to ``False``, and we can set the value of $\alpha_0$ as we would normally set the ``learning_rate``, $r$ with the keyword ``learning_rate_decay_factor`` and $\tau$ with the keyword ``learning_rate_decay_step``.

#### Metrics
If we want to track a metric different from the MAE (or in addition to it) from the command line during training, we can use the keyword ``metrics`` in the io_information, followed by a comma separated list of the following values: ``MAE``, ``RMSE`` or ``loss``. The loss option reports all components the contribute to the loss driving the training.

#### Validation during training
At the end of each epoch (the one we set in the training input, not necessarily as imposed by the dataset), we can ask PANNA to automatically evaluate the model on a small set of validation examples. While this set is typically smaller than the full training set, it can be useful to give us an idea whether our model is overfitting to the tranining set, or the error is decreasing also over unseen examples. To enable this feature, a new card called ``[VALIDATION_OPTIONS]`` should be added, including at least the keyword ``data_dir`` indicating the location of the validation set.

---

We can now look at a more complete training input file to see all of these options in place (to keep computational cost at a minimum, we go back to a simple training without forces, we will modify the original input file with the new options).

In [9]:
!cat {panna_cmdir+'/doc/tutorial/input_files/my_adv_train.ini'}

[IO_INFORMATION]
data_dir = ./tutorial_data/train
train_dir = ./my_adv_train
log_frequency = 100
save_checkpoint_steps = 500
metrics = RMSE,loss

[DATA_INFORMATION]
atomic_sequence = H, C, O, N
output_offset = -13.62, -1029.41, -2041.84, -1484.87

[TRAINING_PARAMETERS]
batch_size = 20
steps_per_epoch = 100
max_epochs = 10
wscale_l1 = 1e-4
bscale_l1 = 1e-4
learning_rate_constant = False
learning_rate = 0.01
learning_rate_decay_factor = 0.1
learning_rate_decay_step = 200

[DEFAULT_NETWORK]
g_size = 384
architecture = 128:32:1
trainable = 1:1:1

[VALIDATION_OPTIONS]
data_dir = ./tutorial_data/validate


We have set an L1 regularization equal to 0.0001, we decay the learning rate of 0.1 every 200 steps, i.e. from 0.01 to 1e-7, we want to track the RMSE and all loss components, and we will validate on the set used for validation in the previous tutorial.

We can now run this small training.

In [11]:
!cd {panna_cmdir+'/doc/tutorial/'}; python {panna_cmdir+'/src/panna/train.py'} --config ./input_files/my_adv_train.ini

2023-04-24 17:58:12.628080: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-04-24 17:58:12.628126: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
INFO - 
    ____   _    _   _ _   _    _           
   |  _ \ / \  | \ | | \ | |  / \     
   | |_) / _ \ |  \| |  \| | / _ \     
   |  __/ ___ \| |\  | |\  |/ ___ \    
   |_| /_/   \_\_| \_|_| \_/_/   \_\ 

 Properties from Artificial Neural Network Architectures

INFO - reading ./input_files/my_adv_train.ini
INFO - Found a default network!
INFO - This network size will be used as default for all species unless 

We can see that now the RMSE is reported, and the loss including the regularization loss. In addition, we see that at the end of each epoch the same values are reported for the validation set. All this is very important to monitor complex training cases and find the best hyperparameters.

### 4.3 — Computing Gvectors during training

Finally, we want to show a training option that can be very useful to iterate parameters quickly, or in cases of very large datasets.
As we have mentioned, precomputing the derivatives of the descriptors can take up a large amount of space, and create a large dataset that needs to be loaded to memory while training, possibly a number of times. While this is often worth the savings in computation time (since examples are used many times and the descriptors are always the same), it can be a problem in a few cases, like when the training set is very large (or I/O limited in the machine used for training), or if a some quick training needs to be done to test descriptor parameters, and we do not want to create multiple large copies of the descritors for single use.

For all these cases, it is now possible in PANNA to compute the descriptors from the example files while we are training the network. This is considerably more computationally expensive, but feasible on last generation GPUs. More specifically, please note that the first training steps can be especially slow, because the code needs to be optimized for different inputs. As the train progresses, you will typically see a speedup.

To enable this option, we need to set the option ``input_format`` to ``example`` in the io_information (the default was ``tfr``). Also, we need to specify the parameters of the descriptors, so we can use the keyword ``gvect_ini`` and pass the same input file as we have prepared for the precomputation. Now we can simply indicate the ``data_dir`` where the ``.example`` files are located, and we can start the training.

Let us look at a sample training file that reuses the data and parameters we used in the rest of the tutorial (to keep the training light, we will not use forces in this example, although this is not the most common use case), then run a short training.

In [19]:
!cat {panna_cmdir+'/doc/tutorial/input_files/mytrain_fromex.ini'}

[IO_INFORMATION]
data_dir = ./tutorial_data/simulations
train_dir = ./my_train_fromex
input_format = example
gvect_ini = ./input_files/mygvect_sample.ini
log_frequency = 10
save_checkpoint_steps = 100

[DATA_INFORMATION]
atomic_sequence = H, C, O, N
output_offset = -13.62, -1029.41, -2041.84, -1484.87

[TRAINING_PARAMETERS]
batch_size = 5
learning_rate = 0.01
steps_per_epoch = 20
max_epochs = 5

[DEFAULT_NETWORK]
g_size = 384
architecture = 128:32:1
trainable = 1:1:1


In [20]:
!cd {panna_cmdir+'/doc/tutorial/'}; python {panna_cmdir+'/src/panna/train.py'} --config ./input_files/mytrain_fromex.ini

2023-04-25 11:11:42.626440: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-04-25 11:11:42.626521: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
INFO - 
    ____   _    _   _ _   _    _           
   |  _ \ / \  | \ | | \ | |  / \     
   | |_) / _ \ |  \| |  \| | / _ \     
   |  __/ ___ \| |\  | |\  |/ ___ \    
   |_| /_/   \_\_| \_|_| \_/_/   \_\ 

 Properties from Artificial Neural Network Architectures

INFO - reading ./input_files/mytrain_fromex.ini
INFO - Found a default network!
INFO - This network size will be used as default for all species unles

In [21]:
# Run this to cleanup the tutorial directory
!cd {panna_cmdir+'/doc/tutorial'}; rm -rf my_train_force my_adv_train tf.log my_train_fromex