## Connect to the Advanced Research Computing (ARC)

Open a terminal and remotely connect to ARCUS using ssh (secure shell), using your username:

`% ssh -X username@oscgate.arc.ox.ac.uk`

When prompted, enter your password and hit Enter.
Now, you need to connect to the arcus-htc cluster, so in your command line type:

`% ssh -X arcus-htc`

If you get a message like this one:

The authenticity of host 'arcus-htc (10.137.128.21)' can't be established.
RSA key fingerprint is d1:83:21:b1:f2:bd:8f:e7:5d:cd:74:d1:73:b9:70:7a.
Are you sure you want to continue connecting (yes/no)? 

Type 

`% yes`

You can now check the path to the directory you are at, by typing:

`% pwd`

It should be your home directory (/home/username). This is where you will work from, you will need to create some subdirectories to organise your files. First creatge a directory for your setup and move to it, as it will be your first working directory:

`% mkdir setup`\
`% cd setup`

Now, you need to load GROMACS, which is the simulation software package that you will utilise:

`% module load gpu/gromacs/2020.1`

Check that it was successfully loaded, by typing:

`% gmx help commands`

This will print basic information for every built-in GROMACS command. If you need more detailed information on a command in particular, you can type:

`% gmx [command] -h`

## Specifics to GROMACS

GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. (http://www.gromacs.org/)

- Coordinates files have the extension .gro and the default name is conf.gro.
- The topology file (default name topol.top) contains all the information about which atoms are bonded to which and what force-field parameters are applied etc.
- The trajectory files have the extension .xtc and .trr, the former does not contain velocity information and coordinates are held at a reduced precision, and so occupies less disk space. However you will need velocities if you want to continue a simulation.
- The .edr file contains the energy information from the trajectory.
- The .mdp file contains the information that was used to setup the actual simulation. Things related to temperature, pressure, how the electrostatics is calculated etc. Although we will not use this today, the file is provided so you can see how this trajectory was made.
- The .ndx file allows you to specify atoms or groups of atoms for use in analysis, restraints, etc and is optional.
- The .tpr file is a binary file that contains all the information needed to perform the actual run (this allows  gromacs to do lots of self-consistency checks to minimize user errors).

## Molecular Dynamics Simulation of Protein

### Protein System Setup

In this section, we will obtain our protein coordinates and perform some routine Molecular Dynamics calculations on them.

For this tutorial, we will use the HIV-1 protease structure (1HSG). It is a homodimer with two chains of 99 residues each. The .pdb file of the protein can be obtained from the protein data bank (https://www.rcsb.org/), but it is probably easier for now to just copy it from the shared directory to your working directory. In fact, most of the files you need can be found in this directory so if you get confused/stuck/lost etc then you can always look here to check you did things correctly.

`% cp /home/shared/data/1hsg.pdb .` (path to be verified)

If you look at this file (using the graphical editors vi or nedit for example, you should immediately see that it has two chains; A and B:

`% vi 1hsg.pdb`\
or\
`% nedit 1hsg.pdb`

Now we need to prepare our protein for simulation. First of all we will extract only the protein coordinates from the pdb file into a new file called `protein.pdb`. To do this enter the following command:

`% grep ATOM 1hsg.pdb > protein.pdb`

Use your preferred text editor to open the protein.pdb file and see how it differs from the 1hsg.pdb file.

We now need to make sure that all the hydrogens are added to our protein. This process will also generate the
parameter/topology file we need.

`% gmx pdb2gmx -f protein.pdb -ignh -o protein.gro`

The program should run and present a list of force-fields from which to select. Select the AMBER99SB-ILDN force field which should be option 6 in the list followed by 1 to select the recommended TIP3P water. If all goes well this should generate several files:
1. topol.top 
2. topol_Protein_chain_A.itp 
3. topol_Protein_chain_B.itp 
3. posre_Protein_chain_A.itp
4. posre_Protein_chain_B.itp
5. protein.gro

Type:

`% ls`

to verify that all the above files have been created and are in your directory.\
Note that the protein has a net charge of +4e. You should see a line that says "Total charge
in system 4.000 e".

Before we can add water we need to define a box in which to put the protein and the water:

`% gmx editconf -f protein.gro -box 7 7 7 -c -o boxed.gro`

This puts the protein in the centre of the box that is 7 nm x 7 nm x 7 nm and creates the resulting file boxed.pdb .\
Next we need to add water to the system. We will also add ions -enough to neutralize the system and to a
concentration that is representative of the cell. We can add the water by repeatedly overlaying a small box of water into the system (216 molecules).

`% gmx solvate -cp boxed.gro -cs -o solvated.gro -p topol.top`

You may have noticed in some of the output generated that the total system charge is +4. In order for us to use an Ewald
method to calculate the electrostatic interactions we need to have a neutral system overall. Therefore we will add
counterions (chloride ions, in this case) using the option -neutral and enough ions to make the solution up to 150 mM (-conc 0.15). This is done by replacing random water molecules (SOL) with NA+ or CL- ions.

`% gmx grompp -c solvated.gro -p topol.top -f /home/shared/data/genion.mdp -o genion.tpr`\
`% gmx genion -s genion.tpr -conc 0.15 -neutral -pname NA -nname CL -o system.gro -p topol.top`

When prompted, enter the group that corresponds to SOL (should be 13 or thereabouts).

### Energy Minimization

Before we can run the actual dynamics, we need to first minimize the energy of the system. Ideally you would minimize down until the forces were below a certain level (tolerance), but we will just give a quick burst of 200 steps here. Since we have finished setting up the system, we will now move to a new directory to perform our simulation from:

`% cd ../`

`% mkdir run`

`% cd run`

The `grompp` command will read the information of the system that we will provide (coordinates, topologies and simulation parameters) and will generate a run input file:

`% gmx grompp -c ../setup/system.gro -p ../setup/topol.top -f /home/shared/data/em.mdp -o em.tpr`

We cannot run the simulation on the login nodes of ARC; these can only be used to prepare the system for the simulation. Therefore instead of typing the `mdrun` command -that will initiate the energy minimisation- directly on the command line, we will submit a script that will submit the job to the job scheduler. Copy this script to your working directory:

`% cp /home/shared/data/submit_em.sh .`

You can explore its contents and see that the last line in the file contains the `mdrun` command by typing: 

`% cat submit_em.sh`

The file contents will be printed on your terminal.\
Now submit it to the cluster queue.

`% sbatch submit_em.sh`

It will take a few minutes to run, depending on the waiting times of the queue. Check on the status of the run by typing:

`% squeue -u username`

Remember to replace `username` with your own username! It should print something like this:

`             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           1311337       htc       EM bioc1550 PD       0:00      1 (Priority) 
`

The `PD` in the fifth column denotes that the job is in the queue and has not started running yet. It will change to `R` once it starts running:

`             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
           1311337       htc       EM bioc1550  R       1:26      1 arcus-htc-node110 
`

If the job has finished (or if it has failed), the above command will print nothing:

`JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)`

If you made a mistake and need to cancel this job, type:

`% scancel JOBID`

`JOBID` should be replaced by the ID of this particular job (printed in the first column; see above).

While the job is running, you can monitor its progress by typing: 

`% tail -n 12 em.log`

This will show you what step the simulation is at or if it has finished.\
Once the energy minimization has finished, we can examine it in terms of the potential energy versus the number of minimization steps:

`% gmx energy -s em.tpr -f em.edr -o em_potential_energy.xvg`

type in 10 when prompted (should correspond to potential energy from the list of options presented), then a zero to finish.\
The plot should look like this one:

<img src="em_potential_energy.png" width=500 height=400/>

### Production Run

At this stage, we would normally run a short simulation where the protein atoms are restrained while the water molecules
and ions are allowed to freely move around and equilibrate around the protein. For this tutorial, we will skip this bit due to
limited time. Now finally let us perform some molecular dynamics:

`% gmx grompp -c em.gro -p ../setup/topol.top -f /home/shared/data/md.mdp -maxwarn 1 -o md.tpr`

`% cp /home/shared/data/submit_md.sh .`

`% sbatch submit_md.sh`

You can check the status of the job again as described previously.

At the moment it is set up to run for 1000 ps. This will take several minutes to complete depending on the waiting time in the queue - time for lunch! You don't have to wait for it to finish completely though, although now might be a good time for a break to allow at least some data to appear. The analysis can be done on the output files that will be generated or you can always use the "one I made earlier" in the directory /home/shared/prerun/run (This is 1000 ps simulation of the same system).

After the end of the production run, it would be useful to obtain some properties that will give us an insight into our protein system.\
There are various so-called ensembles that are used for protein simulations - probably the most common is a system where the number of particles, the pressure and the temperature are held constant (NPT). This is usually achieved by means of a heat-bath. Nevertheless, it is usually a good idea to check these as a function of time through the trajectory just to make sure nothing unexpected happened. First let us check the temperature of our simulation.

`% gmx energy -f md.edr -s md.tpr -o 1hsg_temperature.xvg`

The program will then present you with a large table of all the values recorded in the energy (.edr) file. We want to examine temperature so type 13, press enter and then 0 and press enter again. The program will then analyse the temperature and present some statistics of the analysis at the end.

Another set of properties that is quite useful to examine is the various energetic contributions to the energy. The total energy should be constant. but the various contributions can change and this can sometimes indicate something interesting or strange happening in your simulation. Let us look at some energetic properties of the simulation.

`% gmx energy -s md.tpr -f md.edr -o 1hsg_energies.xvg`

We shall select short-range lennard-jones (7), short range coulombic (9) and the potential energy (11). End your selection with a zero.

Finally, we need to renumber the em.gro file so that the residues of the two chains will not have the same residue number. This will be necessary for when we calculate the RSF values.

`% gmx editconf -f em.gro -o em_renumbered.gro -resnr 1`

We will plot and explore the temperature and the energetic components that we obtained in the next section of the tutorial.


### File Transfer 

As soon as the simulation is finished, you should go to your local terminal and transfer the files from the remote directory to your local directory.\
First, in your local terminal, go to the `OxCompBio/tutorials/MD directory`.

The use `scp` to transfer the remote run subdirectory to your local directory:

`% scp username@oscgate.arc.ox.ac.uk:/home/username/run/ .`

When prompted, enter your password. Go to the new subdirectory that contains all the simulation output files we will use for the analysis:

`% cd run`

Now you are ready to perform some types of analysis of the simulation trajectory!
