# How to use ASE database to manage calculations in a HPC

In this hands-on, you will make use of the ASE database to manage calculations in a HPC environment. The material is based on the [ASE tutorial for surface adosprtion](https://wiki.fysik.dtu.dk/ase/tutorials/db/db.html).

Further information can be found at the [ASE database main page](https://wiki.fysik.dtu.dk/ase/ase/db/db.html)

Prerequisites:

* ase > 3.19
* numpy
* bash/powershell

For this tutorial, you just need to run the cells in this notebook. In a HPC, run the cells that start with `!` or `%` in a bash shell (terminal).

# Creating surfaces and storing in the database

As a test example, you will create a database with three low-Miller index surfaces of a metal (or many metals). To keep simple,  the `EMT` calculator implemented in ASE will be used. However, the workflow will work for any calculator implemented in ASE.

If you combine your knowledge in ase/python and bash, you can do many other things.

In [None]:
# load the necessary modules
from ase.build import fcc100,fcc110,fcc111
from ase.calculators.emt import EMT
from ase.db import connect

In [None]:
# give a name to your database
dbname = 'surfaces.db'
db = connect(dbname,append=False) 
# append=False overwrites an existing database. To avoid that, use append=True

In [None]:
# make some key-value pairs to help you keeping track of the calculations
kvp_hpc = {
    'queued':'False',
    'started':'False',
    'converged':'False'
}

In [None]:
# build your calculator
calc = EMT()

In [None]:
# create a dictionary to automate the creation of the surfaces
surfaces = {
    'fcc100':fcc100,
    'fcc110':fcc110,
    'fcc111':fcc111,    
}

In [None]:
# choose one (or more) of the following elements, and build the surfaces:
#for symb in ['Al', 'Ni', 'Cu', 'Pd', 'Ag', 'Pt', 'Au']:
for symb in ['Au']:
    for surf in surfaces.keys():
        for n in [1, 2, 3]: # make surfaces with 1, 2 and 3 layers
            atoms = surfaces[surf](symb,(1,1,n))
            atoms.calc = calc # attach the calculator
            db.write(atoms, symbol=symb, surface=surf, layers=n, **kvp_hpc) # write into the database

Now let's look at the database using the `ase db` command line interface. The help command is your best friend here:

In [None]:
! ase db --help

In [None]:
# this command provides a quite useful database description
!ase db {dbname} -c ++ -L 0

The idea of using database is to keep your structures organized. It is a good practice to keep the same with your folders. Here is one suggestion of three main folders for a project:

* **The run folder:** Here you submit your calculations. Making subfolders will help you finding structures when your database grows. And trust me, it will grow.

* **The scripts folder:** Keep all scripts that you wrote for the project here. It can be useful later when you share your workflow with someone.

* **The backup folder:** Once in a while make a copy of your database to the backup folder. If for whatever reason the database file corrupts, then you have a backup.

Let us make the run and the backup folders. The script folder was previously created and contains the same content included in this notebook.

In [None]:
!mkdir run backup

In [None]:
%cd run

Now let's make the folders that will be used in our calculations. Here each folder will be named with the `row id` that is created automatically when a new row is written in the database. The cell below contains the same content as the file `prepare_folders.py` from the `scripts` folder. Run the next cell for creating subfolders:

In [None]:
# %load ../scripts/prepare_folders.py
import os
import shutil
from ase.db import connect
 
db = connect('../'+dbname)
 
prevdir = os.getcwd()

# here query can be anything
query = ''

for row in db.select(query):
    dir = str(row.id)
    # create the subfolders
    try:
        os.mkdir(dir)
    except FileExistsError:
        print(f'Keeping folder {dir}')
    else:
        print(f'Creating folder {dir}')
    os.chdir(dir)
    # make a symbolic link for your sbatch submission file
    try:
        os.symlink('../run.sh', 'run.sh')
    except:
        pass
    # write a file with the row id
    with open('db_id', 'w') as out:
        out.write(dir)
    os.chdir(prevdir)

Copy the necessary files to submit the calculation to a queue

In [None]:
%cp ../scripts/worker.sh ../scripts/run.sh ../scripts/run.py .

Let us look at these files:

`worker.sh`

This script walks through your sufolders, and perform some actions. The action can be submitting a calculation, or running a script that analyzes the results, for example. For this tutorial we will enter on each subfolder and run the `run.py` script:

```bash
#!/bin/bash                                                                    

# different ways to walk in the subfolders 
calc_id="$(find . -type d | sort -n |awk -F  "/" '{print $2}')"
#calc_id="$(cat list_id.txt)"
#calc_id="$(seq 1 9)"
#calc_id="3"
 
home_dir=$(pwd)

#Be careful here. Double-check if this is the database you want to connect.
db="../../surfaces.db" # A full path is preferred. 
 
for i in $calc_id ; do
    work_id=${i}
    cd $work_id
    echo ========
    echo $work_id
    echo ========
    # submit the calculation
    #sbatch --job-name=$work_id.surfaces run.sh > JobID ; more JobID
    # update the status in the database
    ase db $db id=$work_id -k queued=True
    # run a script in the subfolder
    python ../run.py $db
    cd $home_dir
done
```
***
`run.sh`

This is one example of a sbatch submission file for [Kebnekaise@HPC2N](https://www.hpc2n.umu.se/resources/hardware/kebnekaise). Usually the `worker.sh` script submits this script to the queue in each subfolder.

```bash
#!/bin/bash -l
# The -l above is required to get the full environment with modules

#SBATCH -A snic2020-1-41
#SBATCH --mail-type=ALL
#SBATCH -t 1-00:00:00

#SBATCH -J id_xx
#SBATCH -o dat.out

# number of nodes
#SBATCH -n 28

# NEEDED MODULES FOR THE CALCULATION
module purge
source ~/calcs/env/load_vasp_ase

# RUN THE PYTHON CODE
python ../run.py
```

***
`run.py`

This is one example on how to use ASE and its database module to run calculations.

```python
import numpy as np
from ase.db import connect
from sys import argv

dbname = argv[1]
dbid = int(np.loadtxt('db_id'))

db = connect(dbname)

# get the atoms with the calculator attached
atoms = db.get_atoms(db_id,attach_calculator=True)

# update started key-value pair in the db
db.update(db_id,started=True)

# run the calculation using ase
atoms.get_potential_energy()

print('===========================================')
print('Relaxation completed')
print('===========================================')

# update the database with the final geometry
db.update(dbid,atoms=atoms,converged=True)
```

Now let's pretend that you will submit your calculations to a queue. You will be in a login node, and the only thing you need to do is execute `worker.sh` in the run folder. The events in a cronological order would be:

1. Change directory to a subfolder.
1. Submit the calculation to the queuying system.
1. Update the status in the database (queued=True).
1. Change to the parent directory.
1. Change directory to another subfolder.
1. Repeat steps 2, 3, and 4.
1. ...

In the `run.py` script, two other key-value pairs will be updated in the database: `started` right before running the calculations, and `converged` once the calculation is finished.

For this tutorial, we skip the submission to the queue. Instead, we run the `run.py` script in each subfolder, and update the results in the database. When you run the next cell, you will see in cronological order:
1. The folder name that you entered.
1. A message saying that one key-value pair was updated in the database.
1. A message after the calculations has been completed.

Let's run!

In [None]:
!bash worker.sh

And that's it. Your database has now the relaxed surfaces for further analysis. To finish with the tutorial, let's see the database once again:

In [None]:
# leave the run folder
%cd ..

# quick overview of the database
!ase db surfaces.db -c++ -L 0