# Assignment 1 - Molecular Structure Visualization, Normal Modes, and RDKit Basics

### Due on Monday, 3 February 2025 at 9:29 AM

### Name: [Your Name Here]

The objective of this assignment is three-fold: first, to become familiar with protein structures and the Protein Data Bank, second, to try some of the programs used to display protein structures, and third to learn how these programs can be used for research and understanding.

## The Protein Data Bank

The [Protein Data Bank](https://www.rcsb.org/), or PDB, is the best place to obtain 3-Dimensional structures of proteins, nucleic acids, peptides and other macromolecules. From their site, you can locate proteins by keyword searching or by entering the accession number for the structure file, like 1mba. Most of the structures in the PDB were determined by [X-ray crystallography](https://en.wikipedia.org/wiki/X-ray_crystallography) or [NMR spectroscopy](https://en.wikipedia.org/wiki/Nuclear_magnetic_resonance_spectroscopy) with a growing trend of using [Cryo-EM](https://en.wikipedia.org/wiki/Cryogenic_electron_microscopy). Details on the molecules in the structure file, how the structures were determined, pertinent research articles, etc. can be found on the web site but also in the pdb file itself.

PDB files are just formatted text files, so you can open them in a text editor or even Word and read them. There is a wealth of information there! The molecular viewing programs use the ATOM records in the file, which contain residue name, residue number, atom name and atom number - all of which are important when selecting molecules or parts of them to visualize. Often, the structure files include other molecules besides the protein, such as water molecules, nucleic acid and bound ligands.

## Molecular Graphics Programs 

There are quite a few programs used to view molecules.. Each has its advantages and disadvantages. 

### [PyMol](https://pymol.org/2/) 

PyMol is a relatively new and very promising molecule viewing program developed by [Schodinger](https://www.schrodinger.com/). It is available for all Unix/Linux platforms, Windows and Mac OSX. It uses a combination of menus, a powerful selection sidebar, ( to demonstrated in the lab), and a rich command line interface. It has a logging feature to remember commands as they are executed, allows for scripting and has a nice movie feature.

PyMol is free to download and build as it is open source software. However, Purdue University provides all students with a site license which allows for simple installation and execution. You can download it here: [Pymol version 2](https://pymol.org/2/). You must also install the [site license](https://pymol.org/dsc/index.php?ip=license/) while on campus as the link will not work otherwise.

### [Visual Molecular Dynamics (VMD)](http://www.ks.uiuc.edu/Research/vmd/) 

VMD is new and very promising molecule viewing program developed at University of Illinois at Urbana-Champaign. It is available for all Unix/Linux platforms, Windows and Mac OSX. It uses a combination of three windows: a terminal window, a viewing window and a main window. The terminal window is useful in that error messages and command output. The view window is where your molecules will appear, and the main window is where most of the controls for VMD are found. We strongly encourage you to go through the VMD tutorial also which can be found on the [VMD website](http://www.ks.uiuc.edu/Research/vmd/current/docs.html#tutorials), if you need more information to create this assignment. VMD is a very important tool which would be helpful for further assignments as well.

VMD is under rapid development, so get the latest program from the [VMD Web Site](https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=VMD).

## Part 1

### Task 1

Learn to download structure files. Some structure files are often provided with the viewer programs for demonstration purposes, but you should download the structures of different proteins from the [PDB](https://www.rcsb.org/).

### Task 2

Learn to use Two Different Molecule Viewing Programs (Pymol and VMD).
- Learn to rotate and zoom in on different parts of the structure.
- Play with different structure representations.
- Learn how to select different molecules, protein chains, residues and atoms.
- Learn how to output the structures in different image formats.

### Question 1 

Choose Two Protein Structures (of Two Different Proteins).
- These should be structures that you download from the PDB, not ones that were supplied with the graphics programs.
- If you don't have the slightest idea what structures to choose, you can try the Molecule of the Month..
- There are links to articles related to each structure in the database, and there is also important information within the header section of the pdb file itself. Learn about the proteins and the other molecules in the structure files.

List the names and the PDB ID's of the proteins in your notebook. Describe the biological function of the proteins you chose. Remember that there is a great deal of information in the PDB header, including references to papers by the groups that did the structure determination.

### Answer

### Question 2 

Use the programs to generate illustrations of interesting aspects of the structure of these proteins.

- You can export images from the programs themselves or capture images straight of the screen, as discussed the lecture. Only submit compressed image file formats, like jpeg or gif. Tiff, pict and other raw formats generate massive files. You should include images generated by both viewing programs in your submission.
- Each image should illustrate some property of the protein's structure or function. We do not just want pretty pictures - the biology needs to be there too! Some ideas:
    - Show the architecture of the protein, secondary structure, packing, surface, fold, etc.
    - Show some examples of interactions that stabilize the structure.
    - If your pdb file has other molecules besides the protein in it, show them and how they interact with the protein. What is the biological relevance of their presence?
    - Is your protein oligomeric or part of a complex? How can that be clearly shown?
    - Zoom in to clearly show interesting details.
    - Show how the structure might affect the dynamics of the molecule.
- Use one or more images to show how the structure of the protein relates to its function.

Include a short, clear description of the biological idea conveyed in each image and answer/explanations to all questions asked in the assignment.

You will need 2 images for each software (PyMol and VMD) for a total of 4 images with the corresponding biological idea descriptions. Try to capture different biological features in the 4 images if you can.

### Answer

### Question 3 

Use the program of your choice to make a simulation movie.

- Many publications and conference speakers now include videoes of their simulations as part of the submissions, thus it is important to learn how to make these publications to remain competitive.
- You must submit a video made in either PyMOL or VMD. The initial state of the system and the trajectory used to create the simulation will be provided to you, all you need to do is make the movie.

### Answer

## Normal Mode Analysis

The [Karsten Suhre Lab](https://qatar-weill.cornell.edu/research/research-faculty/suhre-lab) developed the server [elNemo](https://www.sciences.univ-nantes.fr/elnemo/), which can be used to calculate normal modes.

1. Visit the elNemo Server and submit one conformation of the same protein you used above. Calculate a few of the lowest normal modes, but remember that the first 6 normal modes are rotational and translational. (Hint: normal mode analysis of "open" conformations gives better results than the analysis of a "closed" conformation).

2. The output of the server will be a number of PDBs along with animation files, each representing the motion due to one normal mode. Upload the files to your favorite molecular viewer and observe the mode-induced movements.

## Part 2 

### Question 4 

Analyze the Normal Modes of one of the proteins that you used in part 1 of the assignment

- Describe the motion induced by the three lowest normal modes ("breathing", twisting, hinging...) and include some stills. 
- Is there a low frequency normal mode that is similar to the motion seen in the morph server? 
- Can the biologically relevant motion be described by a small number of low frequency normal modes?

### Answer

### Question 5 

Describe why the motion of your protein is BIOLOGICALLY interesting (one or two paragraphs) and any information that you think the morph, the normal mode analysis, or data generated on the server webpages may have revealed.

### Answer

## RDKit 

RDKit is a powerful cheminformatics package that integrates well with python. It has a wide range of functionality including substructure searching, chemical descriptor calculation, and molecular fingerprinting. In this part of the assignment, we will get to explore some of these features. 

For previous python code segments, we did not need any external packages because what we used built-in packages. However, RDKit is not a built-in package so we must install it in a package manager in order to use it in a Jupyter Notebook. After logging on to scholar we run the following lines of code to install a conda enviroment with RDKit.

```
module load anaconda
conda create -c rdkit -n chm579 rdkit
```

Follow the on screen prompting to finish installing the RDKit enviroment. After installing the anaconda enviroment we need to activate it and install the Jupyter Notebook using`pip`.

```
conda activate chm579
pip install jupyterhub jupyterlab jupyter jsonschema jupyterlab-server
pip install matplotlib
```

Finaly we can open the Jupyter Notebook with the conda enviroment packages availible to us by typing the below line of code.

```
jupyter notebook
```

You can also access the Juypter notebook by downloading the Jupyter Notebook extension for VS code

The final step is to switch the kernal from Python 3.8 to Python 3. While these are likely the same versions on Python, the Python 3 kernel should be in the anaconda enviroment with RDKit, thus giving you access to the package. You can do this by going to the toolbar at the top of the page and selecting the `Kernel` option, going down to `Change kernel` and selecting the `Python 3` option. Your current kernel can be seen on the top right hand area of the screen and it should show `Python 3` Where it once had `Python 3.8 (Anaconda ...)`. Now you are ready to use RDKit!

RDKit has a object called a Molecule (denoted Mol) that is used to store information about a chemical compound. First, we must tell the program which molecule we would like to access. One way to do this is to get the molecule from a [SMILES string](https://daylight.com/dayhtml/doc/theory/theory.smiles.html). You can use [this](https://www.cheminfo.org/flavor/malaria/Utilities/SMILES_generator___checker/index.html) site to convery a molecule to a SMILES string. Below is an example of some of the features of RDKit on Propanol.

We read in the SMILES string of the molecule using `Chem.MolFromSmiles()` with an argument of `OCCC` which is the SMILES string that corresponds to propanol. Run the below code segment to load in the molecule from a SMILES string.

In [None]:
# Import a list of Subpackages from RDKit
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import Crippen
from rdkit.Chem.Draw import rdMolDraw2D

# Define and show Propanol
m = Chem.MolFromSmiles('OCCC')
m

: 

We can get some information about individual atoms and bonds of the molecule.

In [None]:
# Print out Properties of Individual Atoms
print("Atom 0 Symbol: " + m.GetAtomWithIdx(0).GetSymbol())
print("Atom 0 is in Ring: " + str(m.GetAtomWithIdx(0).IsInRing()))
print("Bond between atom 0 and atom 1: " + str(m.GetBondBetweenAtoms(0,1).GetBondType()))

We can get information about the molecule as a whole including calculation of chemical properties.

In [None]:
# Print out Properties of the Molecule
print("Number of Atoms: " + str(m.GetNumAtoms()))
print("Molecular Weight: " + str(Descriptors.ExactMolWt(m)))
print("Number Valance Electrons: " + str(Descriptors.NumValenceElectrons(m)))
print("Molecular LogP: " + str(Descriptors.MolLogP(m)))
print("Topological Polar Surface Area: " + str(Descriptors.TPSA(m)))

We can highlight atoms and bonds in the molecule to showcase certian structural attributes. We need to use `IPython.display` to show the image as the RDKit drawing feature is not supported on the front end of Jupyter Notebook (There may be times where you need to do this again in the future as well).

In [None]:
# Import Image Display Software for Jupyter Python Code
from IPython.display import Image

# Load in a Molecule
m = Chem.MolFromSmiles("Oc1ccccc1O")
# Set up an Image Canvas
d = rdMolDraw2D.MolDraw2DCairo(500, 500)
# Draw the Molecule to the Canvas with Specific Atoms and Bonds highlighted
rdMolDraw2D.PrepareAndDrawMolecule(d, m, highlightAtoms=[0,1,6,7], highlightBonds=[0,6,7])
d.DrawMolecule(m)
d.FinishDrawing()
# Write out the Image to a File and then Load in the File to the Notebook
d.WriteDrawingText('highlight_mol.png')  
Image(filename='highlight_mol.png') 

Here is an additional code example which may help with Question 8

In [None]:
# Import Image Display Software for Jupyter Python Code
from IPython.display import Image
# Define the colors array (cell 0 is green, cell 1 is red)
colours = [(0.0,255.0,0.0), (255.0,0.0,0.0)]
# Make a dictionary for the atom colors and define the hilighted atoms and which atoms are what colors 
atom_cols = {}
h_atoms = [0,1,6,7]
green = [0,1]
red = [6,7]
# Assign the green atoms to the green color
for elem in green:
    atom_cols[elem] = colours[0]
    
# Assign the red atoms to the red color
for elem in red:
    atom_cols[elem] = colours[1]
        
# Load in a Molecule
m = Chem.MolFromSmiles("Oc1ccccc1O")
# Set up an Image Canvas
d = rdMolDraw2D.MolDraw2DCairo(500, 500)
# Draw the Molecule to the Canvas with Specific Atoms and Bonds highlighted
rdMolDraw2D.PrepareAndDrawMolecule(d, m, highlightAtoms=h_atoms, highlightAtomColors=atom_cols)
d.DrawMolecule(m)
d.FinishDrawing()
# Write out the Image to a File and then Load in the File to the Notebook
d.WriteDrawingText('highlight_mol.png')  
Image(filename='highlight_mol.png') 

We can compare molecules and try to identify similar regions in certian molecues. Here are two benzodiazepines with similar, yet non-identical structures. What regions do you think will be similar in the below molecular structures.

In [None]:
# Load in Both of the Molecules
m1 = Chem.MolFromSmiles('C1C(=O)NC2=C(C=C(C=C2)[N+](=O)[O-])C(=N1)C3=CC=CC=C3Cl')
m2 = Chem.MolFromSmiles('C1=CC=C(C=C1)C2=NC(C(=O)NC3=C2C=C(C=C3)Cl)O')

# Allow for more than one output for a notebook cell (Nothing to do with RDKit)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Print both molecules
m1
m2

Now run the code block below to see the regions of similarity between the two molecules. The green represent similar regions whereas the red represents regions that are different. The similarity is radius-based from the central atom, so the plot will look like a topography map. The darker green regions with higher topology have a higher degree of similarity than the lighter green regions with lower topology. The same principal applies for the red regions of disimilairty. 

In [None]:
from rdkit.Chem import Draw
from rdkit.Chem.Draw import SimilarityMaps
import matplotlib

fig, maxweight = SimilarityMaps.GetSimilarityMapForFingerprint(m2, m1, SimilarityMaps.GetMorganFingerprint)

We can also identify regions of interest for a specific descriptor. For example we can visualize contribution to molecular LogP using the following code.

In [None]:
from rdkit.Chem import rdMolDescriptors
contribs = rdMolDescriptors._CalcCrippenContribs(m1)
fig = SimilarityMaps.GetSimilarityMapFromWeights(m1,[x for x,y in contribs], colorMap='jet', contourLines=10)

We can also visual partial charges of the molecule from the individual atoms (Think how this relates to TPSA).

In [None]:
from rdkit.Chem import AllChem
AllChem.ComputeGasteigerCharges(m1)
contribs = [m1.GetAtomWithIdx(i).GetDoubleProp('_GasteigerCharge') for i in range(m1.GetNumAtoms())]
fig = SimilarityMaps.GetSimilarityMapFromWeights(m1, contribs, colorMap='jet', contourLines=10)

## Part 3

### Question 6

Use RDKit to import two of your favriote drugs of the same drug class. If you dont have a favriote drug or drug class you can use [propionic acid derivatives NSAID's](https://en.wikipedia.org/wiki/Nonsteroidal_anti-inflammatory_drug#Propionic_acid_derivatives). You can get the SMILES string from the [PubChem Database](https://pubchem.ncbi.nlm.nih.gov/) or you can draw the structure in the SMILES translator described above. 

### Answer

In [None]:
# I recomend readding all of the previous import statements above as if you exit out of the notebook you would need
# to rerun all of the previous code block to have access to the import statements again

# Add import statements here ...

# Allow for more than one output for a notebook cell (You may need to rNothing to do with RDKit)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Your code goes here ...

### Question 7



Print out the octanol-water partition coefficient (LogP) and topological polar surface area (TPSA) for both of your drugs. Check these values againts published values if possible (at least the LogP should be on PubChem). 

A higher LogP value indicates a molecule that is more attracted to the organic phase (non-polar). A higher TPSA value indicates a molecule with more of its surface area coming from polar atoms. Write a short explanation on if these LogP and TPSA values make sense relitave to the structures of the two drugs. What individual atoms do you think are making the most contribution tot he LogP of the molecule? What individual atoms or groups do you think are making the most contribution to the polarity and non-polarity of the molecule?

### Answer

In [None]:
# Your code goes here ...

### Question 8

Get the similarity plot between the two molecule (it dosent matter which molecule you use as the reference, just keep it consistent througout the remainder of the assignment). Recreate this similarity plot by highlighting the similar atoms (no need to highlight the bonds) in green and the different atoms in red. To learn how to highlight in different colors open the [RDKit guide](https://www.rdkit.org/docs/GettingStartedInPython.html) and control+f (or whatever find is on Mac) "colours" (the british spelling).  

### Answer

In [None]:
# Add the code for the similarity map here ...

In [None]:
# Add the code for the highlighting here ...

### Question 9

Get the plot of LogP contributions for one of the two drugs you have chosen. Take note of the contributions and compare this list of atoms with the your answer to Question 7. Do these LogP individual contributions make sense? Is there anything that you missed?

### Answer

In [None]:
# Add the code for the plot here ... 

### Question 10

Get the plot of partial charges of the molecule. How does this plot relate to the TPSA of the molecule? Do the partial charges  that you see in the plot match up with your answer for Question 7? Do these partial charges make sense? Is there anything that you missed?

### Answer

In [None]:
# Add the code for your plot here ...