# CHEM3580 Workshop 1 #

## Introduction to MD in Python & Jupyter ##

As we have seen in the lectures for this section of the course, MD (molecular dynamics) is an extremely powerful tool for simulation the structure, behaviour and properties of colloid and interface systems. The practical component of this section consists of a series of computer labs where you will be using MD to simulate various condensed phase phenomena, and connecting your results to concepts in other parts of the course. 

The practical method we will be using for these workshops is the (very popular) Python programming language, via (also very popular) Jupyter notebooks. A *great* introduction to Python and Jupyter is provided [here](https://github.com/praiteri/TeachingNotebook/blob/main/introductionToPython/introductionToPython.ipynb), and working through this introduction would be great preparation for these workshops.

In each workshop, you will be provided with a Jupyter notebook file, like this one, that explains the aims and expectations of the workshop. The notebook file also includes Python code that you can use as a *template* for your own simulations in the workshop. You will be expected to adjust the python code to your own purposes - note though that you will not need any former experience with Python for these workshops.

We will be running the notebooks on cloud-based servers (instructions available on the CHEM3580 canvas site), where all of the necessary Python software is pre-installed for you. You *should* be able to simply switch your virtual machine on and run the notebooks without any issues. However, if you are keen you can try installing Python and Jupyter on your own device - it is totally free and open-source. You will quickly understand why these computational tools are rapidly becoming the most widely used in science for modelling and data analysis!!

As we have discussed in the lectures, there are a multitude of MD codes available on the web - some are commercial, some are open-source. Some are better than others, or designed for very specific purposes. Here we will be using the [__OpenMM__](https://openmm.org) code, which is an open-source and very powerful set of Python routines for running MD simulations on GPUs. 
To illustrate how to use OpenMM in Jupyter, this notebook illustrates the basic steps, i.e. How to: 

1. Import the necessary Python modules
1. Define the chemical structure you are interested in
1. Define the MD forcefield you want to use
1. Define how Newton's equations of motion will be iterated
1. Define which *ensemble* you are going to use (e.g. NVE, NVT, NPT etc.)
1. Run the initial energy minimisation
1. Run the actual MD trajectory
1. ...and finally... visualise your results!!



### Importing Python Modules ###


One of the most powerful aspects of Python is its 'modular structure'. To illustrate how Python modules work, lets assume we want to calculate the value of $\pi$ to 10 decimal places. There are loads of ways to calculate $\pi$, but here we will use [Leibniz's formula](https://en.wikipedia.org/wiki/Leibniz_formula_for_π) (this is actually the way most computers do it!). We can do this in Python with the following code (*press \<shift\>+\<return\> to run the code in Jupyter for yourself*)

In [None]:
k = 1
pi = 0

for i in range(1000000):
    if i % 2 == 0:
        pi += 4/k
    else:
        pi -= 4/k
    k += 2

print(pi)

Pretty gross huh!! If only there was an easier way to do it... This is exactly what modules are for in Python. A *much* easier alternative is just to use Pythons ```numpy``` module, which does the hard work for you: 

In [None]:
import numpy as np

print(np.pi)

Here the syntax ```import numpy as np``` makes the notebook 'import' all of the variables, functions, algorithms etc. contained in the very-powerful [NumPy](https://numpy.org) python package, which is one of the core mathematical libraries within Python. Also, we are telling our notebook to use the "nickname" ```np``` for the ```numpy``` module, just to make things a bit shorter. ```np.pi``` indicates that the variable ```pi``` is defined within the module ```numpy```. 

We will use the ```OpenMM``` module here to define all of the underlying code and algorithms that we need to perform our MD simulations:

In [None]:
from openmm.app import *
from openmm import *
from openmm.unit import *
from sys import stdout

You should notice that when you ran the cell above, it didn't give you any output - in this case, __no news is good news!!__ Unless you explicitly ask Jupyter, cells will not print output, only error messages or warning messages. 

### Defining Molecular Structures ###

There are many ways to define your starting structure in an MD simulation. The simplest is to read in a structure from a file that was prepared previously - this is the approach that we will most often take in this course. We will come back to how we prepare these files in the next workshop. 

The structure that we will use as the guinea pig for this demonstration is the [villin protein](https://en.wikipedia.org/wiki/Villin-1), which is defined as a ```pdb``` file (```pdb``` = [Protein Data Bank](https://www.rcsb.org) is the file format used by the widely-used repository of protein and enzyme structures used in biology and chemistry). 

In [None]:
pdb = PDBFile('villin.pdb')

Let's take a quick look at the contents of the file: 

In [None]:
with open('villin.pdb', 'r') as f:
    text = f.read()
    print(text)

### Defining MD Simulation Parameters ###

The ```ForceField``` variable in the OpenMM module already contains all of the necessary forcefield parameters that we will use for our simulation. Let's attach them to the variable ```forcefield``` before we create the topology for our simulation:

In [None]:
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')

Now, we will tell OpenMM to create our MD parameters. These parameters basically consist of:  
1. How the atoms interact with each other (in terms of bonding interactions, such as $U_{\text{stretch}}$ potential etc., and non-bonding interactions, such as van der Waals, electrostatic forces etc.)
1. The atoms present in the system and
1. How these atoms are connected together. 

In practice, (1) is determined by the __forcefield__, while (2) and (3) are detrmined by the __topology__. The forcefield parameters themselves (i.e. the parameters that define the bond stretch, angle stretch, torsional potentials etc.) are the AMBER forcefield parameters, which we will talk more about in the lectures. 

Here, we will define all of this information using the ```createSystem``` function in OpenMM, and store the result in the ```system``` variable:

In [None]:
system = forcefield.createSystem(
    pdb.topology,
    nonbondedMethod = PME,
    nonbondedCutoff=1*nanometer,
    constraints=HBonds
)

Here, the OpenMM ```topology``` function scans the pdb file we provided above and determines the bonding and non-bonding connections in the molecule. The other components of the ```system``` are the ```nonbondedMethod```, ```nonbondedCutoff``` and ```constraints``` variables, which describe how electrostatic and van der Waals interactions will be calculated in the Amber14 and TIP3PFB forcefields. We will not worry too much about these details for now. 

### Defining the Integrator ###

The ```system``` variable contains all of the structural and energetic information for our protein. It does not describe *how* we want Newton's equations of motion to be simulated. To do this, we will use OpenMM's ```LangevinMiddleIntegrator``` function, one of the many algorithms OpenMM has for iterating Newton's equations. This function wants to know: 

1. the temperature of the simulation
1. how strongly the temperature of the simulation is controlled
1. the time step of the simulation

We will attach all of this information to the ```integrator``` variable:

In [None]:
integrator = LangevinMiddleIntegrator(
    300*kelvin, #this is the simulation temperature
    1/picosecond, #this is how strongly (how often) the temperature is adjusted during the simulation. >1 controls temperature more strongly, <1 more weakly.
    0.004*picosecond #this is the simulation timestep
)

As we will see with simulations later on, the kind of information that you need to set in the ```integrator``` depends on the kind of simulation that you want to run. For example, if you are using an NPT ensemble, you need to specify both the temperature and pressure of the simulation. But for now we can get going with our calculation... 

### Initial Energy Minimisation ###

We now have everything we need to define our MD simulation, i.e.

1. the geometry and forcefield information (which is defined in the ```system``` variable), and 
1. how we want to simulate the dynamics (which is defined in the ```integrator``` variable)

We can now define our MD simulation, using OpenMM's ```Simulation``` function. Let's define this in a new variable, ```villin_md```:

In [None]:
villin_md = Simulation(pdb.topology, system, integrator)
villin_md.context.setPositions(pdb.positions)

As discussed in lectures, we must always begin an MD trajectory from a __local minimum__ on the potential energy surface. We will use OpenMM's ```minimizeEnergy``` function to do this for us. Here, the syntax ```villin_md.minimizeEnergy``` tells Python to apply the ```minimizeEnergy``` function to the system defined in the ```villin_md``` variable. It will use a maximum of 100 steps (```maxIterations=100```), and only take a few seconds...

In [None]:
print("Minimising energy...")
villin_md.minimizeEnergy(maxIterations=100)

### Running the MD Simulation ###

Now we have our system at a local minimum on the potential energy surface, we will run the MD simulation itself using OpenMM's ```step``` function. First however, we will ask OpenMM to give us the information that we want from this simulation. This is basically the structure of the protein, its energy and the temperature of the simulation. We do this via the ```PDBReporter``` and ```StateDataReport``` functions in OpenMM: 

In [None]:
villin_md.reporters.append(PDBReporter('output.pdb',1000)) # we are asking OpenMM to print the structure of the simulation to the file output.pdb
villin_md.reporters.append(StateDataReporter(
    'output.csv', # we are asking OpenMM to print the temperature & energy to 'standard output' - the information will appear directly in the notebook while the simulation runs. 
    100, # print the temperature & energy every 100th step
    step=True, # print the step number
    time=True, # print the simulation time 
    potentialEnergy=True, # print the energy
    temperature=True # print the temperature 
))

We are finally good to go!!

Let's use OpenMM's ```step``` function to actually run the MD simulation ```villin_md```. To start with, we will just do 1000 steps:

In [None]:
villin_md.step(1000)
with open('output.csv', 'r') as f:
    text = f.read()
    print(text)

Notice the information we requested is printed above as the simulation proceeds. 

### Visualising Your Trajectory ###

Often the most useful information provided by an MD simulation is not temperature or energy, it is *structural*, which means we need a way of looking at the structure or MD trajectory *directly*.

We will use the Python ```nglview``` and ```MDAnalysis``` modules, which enables us to look at structures directly inside the notebook. Let's import the necessary modules: 

In [None]:
import nglview as nv
import MDAnalysis as mda

Here the syntax ```import nglview as nv``` tells Python to use the nickname ```nv``` in place of ```nglview``` for the remainder of the notebook. This is a very common technique that just makes writing code a bit easier. 

To visualise our simulated protein, we need to define a new kind of variable, a 'universe', inside the ```MDAnalysis``` module. The benefits of doing this won't really become obvious until later in the workshop course, but we will find this is very useful for analysing MD trajectories in many ways later. For now, we will use ```nglview```'s ```show_mdanalysis``` function to produce an interactive widget showing our MD simulation below in the notebook:

In [None]:
u = mda.Universe('output.pdb')        # define the MDAnalysis universe using the output.pdb file produced above in our MD simulation
view = nv.show_mdanalysis(u)          # apply the show_analysis function to this universe, and place the output in the 'view' variable
view.add_representation('licorice',selection="water")        # adjust the way we show the water molecules present in the simulation
view                                  # show the view variable

### Plotting Results & Properties ###

In addition to looking at the structure itself, we may also want to visualise the properties of the protein as well, for example here the energy and the temperature. You may be familiar with analysis this kind of data in MS Excel, but here we will see that it is far more convenient and simple to do this directly inside the Jupyter notebook.

To do this we can load the simulation information stored in ```output.csv``` into a variable called ```data```. This is an array of rows and columns; the 2nd, 3rd and 4th columns store the simulation time, potential energy and temperature, respectively. The code below loads ```data``` and then defines three new variables storing each of these properties: 


In [None]:
with open('output.csv', 'r') as f:
    text = f.read()
    print(text)
data = np.genfromtxt('output.csv', skip_header=1,delimiter=',')
times = data[:,1]
potential_energies=data[:,2]
temperatures=data[:,3]

Now we can plot the data directly using the very popular ```matplotlib``` module in Python. First, let's plot the potential energy of the protein as the MD simulation proceeds: 

In [None]:
import matplotlib.pyplot as plt

plt.plot(times, potential_energies)
plt.xlabel('Time (ps)')
plt.ylabel('Potential energy (kJ/mol)')
plt.show()

And now the temperature of the simulation:

In [None]:
plt.plot(times, temperatures)
plt.xlabel('Time (ps)')
plt.ylabel('Temperature (K)')
plt.show()