In [None]:
# Import required libraries
from Bio.PDB import PDBList, PDBParser, PDBIO, Select
import nglview as nv
import mdtraj as md
import pandas as pd
from openbabel import openbabel

<div class="alert alert-block alert-success">

# EPFL Course: CH-630 Drug Discovery

## Doctoral School EDCH

## Week 4: Exercises

</div>

<h1 style="color:green;"> Lesson 4.1: Introductory exercises on molecular modelling </h1>

Proteins are large biomolecules that perform a wide range of essential functions in the body: from catalyzing reactions to trasmitting signals. In drug discovery, undestanding the 3D structure of a protein target is crucial for designing molecules that can bind to the protein, modulating its activity.

In this lesson, you will:

1. Download a real protein structure from the **RCSB Protein Data Bank (PDB)**
2. Visualize it interactively using **NGLView**
3. Perform basic structural manipulations such as:
  - Removing **crystallographic water**
  - Adding **hydrogens**
4. Visualize a protein-ligand complex and undestand different molecular representations
5. Identify **molecular interactions** in the complex

You will finish with an **exercize** to practice analyzing the structure and identifying molecular interactions.

> ðŸ’¡ This lesson will help you become familiar with protein structural files, molecular visualization, and structure manipulation.


<h2 style="color:green;"> Step 1: Retrieve a Protein Structure from the PDB </h2>

In this section, we will retrieve the **crystal structure** of a protein from the [RCSB Protein Data Bank (PDB)](https://www.rcsb.org), a public repository of experimentally determined biomolecular structures.

We will work with **hen egg white lysozyme**, with **PDB ID `1AKI`** (https://www.rcsb.org/structure/1AKI).

In [None]:
pdb_id = "1AKI"  # Lysozyme PDB ID
pdbl = PDBList()
pdbl.retrieve_pdb_file(pdb_id, pdir=".", file_format="pdb")

print(f"Downloaded structure for PDB ID {pdb_id}.")

<h2 style="color:green;"> Step 2: Visualize the structure with NGLView </h2>

To view this structure, we will use the `nglview` library, which enables interactive 3D visualization directly in the notebook.

### cartoon representation

The cartoon representation is a simplified 3D representation of the protein's backbone structure.

It is typically used for visualizing the overall shape and folds of a protein, highlighting the secondary structures.

In [None]:
view = nv.show_file(f"pdb{pdb_id.lower()}.ent")
view.clear_representations()
view.add_cartoon()
view

### ball-and-stick representation

The ball-and-stick representation shows individual atoms as "balls" and the chemical bonds between them as "sticks".

This representation provides atomic information, making it useful for understanding interactions between atoms, ligands, and other small molecules.


In [None]:
view = nv.show_file(f"pdb{pdb_id.lower()}.ent")
view.clear_representations()
view.add_ball_and_stick(selection="all")
view

<h2 style="color:green;"> Step 3: Add Hydrogen Atoms and Remove Crystallographic Water </h2>

From this representation, we can notice:

- **Presence of Crystallographic Water**

- **Absence of Hydrogens**
    
X-ray crystallography does not resolve hydrogen atoms with high precision. As a result, hydrogens are often absent in crystallographic structures.

However, hydrogens play crucial roles in protein function, including hydrogen bonding.â€‹

Therefore, adding hydrogens computationally is essential for accurate modeling and simulation of protein structures.

Here, we are going to strip out the crystal water molecules and add hydrogen atoms, as normally required when modelling a protein.

To remove crystallographic water, we will filter out all HOH residues from the PDB file using Biopython.



In [None]:
class NoWaterSelect(Select):
    def accept_residue(self, residue):
        return residue.get_resname() != "HOH"

# Load structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", f"pdb{pdb_id.lower()}.ent")

# Save without HOH
io = PDBIO()
io.set_structure(structure)
io.save(f"{pdb_id.lower()}_no_water.pdb", NoWaterSelect())

print(f"Saved '{pdb_id.lower()}_no_water.pdb' without crystallographic water molecules.")

After removing the crystallographic water molecules, we can inspect the end of the PDB file to verify that there are no longer any HOH residues.
While the original file (left) contains water molecule (HOH residues), the processed file (right) does not.

![Alt text](image-2.png)

To add hydrogen atoms, we will use `OpenBabel`.

In [None]:

# Create OpenBabel conversion object
ob_conversion = openbabel.OBConversion()
ob_conversion.SetInFormat("pdb")  # Input file format (pdb, in this case)

# Read the structure
mol = openbabel.OBMol()
ob_conversion.ReadFile(mol, f"{pdb_id.lower()}_no_water.pdb")

# Add hydrogens
mol.AddHydrogens()

# Save the structure with hydrogens added
ob_conversion.SetOutFormat("pdb")  # Output format
ob_conversion.WriteFile(mol, "1aki_H.pdb")


We can now inspect PDB file to esnure that it contains H atoms.

![Alt text](image-3.png)

<h2 style="color:green;"> Step 4: Visualize a Protein-Ligand Complex </h2>

In computational drug design, one of the key goals is to understand how small molecules (ligands) bind to biological targets â€” typically proteins.

Here, we will visualize and analyze the T4 lysozyme L99A/M102Q, a model system widely used in protein-ligand binding studies.
In particular, we will focus on a crystallographic structure of this protein in complex with the small molecule 2-propylphenol.

We will start by downloading from the [RCSB Protein Data Bank (PDB)](https://www.rcsb.org) the structure with **PDB ID `3HTB`** (https://www.rcsb.org/structure/3HTB).

In [None]:
# Download 
pdb_id = "3HTB"  # 2-propylphenol in complex with T4 lysozyme L99A/M102Q 
pdbl = PDBList()
pdbl.retrieve_pdb_file(pdb_id, pdir=".", file_format="pdb")

print(f"Downloaded structure for PDB ID {pdb_id}.")

! mv pdb3htb.ent 3htb.pdb

To visualize the proteinâ€“ligand complex, we will represent the protein using a ribbon (cartoon) representation and the ligand using a ball-and-stick representation.

- **Ribbon representation (protein)**: highlights the secondary structure elements (Î±-helices, Î²-sheets, loops) and provides a clear, simplified view of the protein fold, making it easy to see the overall architecture.

- **Ball-and-stick representation (ligand)**: shows the ligandâ€™s atomic details.

The ligand 2-propylphenol is called JZ4 in the PDB structure.

In addition to the protein and the ligand, the structure contains crystal water and other species (as the molecule called BME), which are just crystallization co-solvents. We will need to remove these from the structure.

If you open the PDB file `3htb.pdb` in a text editor, you'll notice that:

- Protein residues are labeled with the keyword ATOM

- Non-protein molecules including crystallographic water, the ligand JZ4, and other co-solvents (BME and PO4) are labeled with HETATM

HETATM stands for heteroatoms. In a PDB file, heteroatoms are any atoms that are not part of the standard protein or nucleic-acid residues.

In this step, we aim to remove all HETATM entries except for the ligand JZ4.

To do this, we will use the `Pandas` Python library to filter the relevant rows, and save the cleaned structure for further analysis.

![Alt text](image.png)

In [None]:
# function to read PDB files with pandas
def read_pdb(pdb_file):
    ''' this function reads a PDB file and returns a pandas df'''

    with open(pdb_file, 'r') as f:
        lines = f.readlines()
    
    data = [line for line in lines if line.startswith(('ATOM', 'HETATM'))]

    df = pd.DataFrame({
    'record': [line[0:6].strip() for line in data],
    'atom_number': [int(line[6:11]) for line in data],
    'atom_name': [line[12:16].strip() for line in data],
    'res_name': [line[17:20].strip() for line in data],
    'chain_id': [line[21] for line in data],
    'res_seq': [int(line[22:26]) for line in data],
    'x': [float(line[30:38]) for line in data],
    'y': [float(line[38:46]) for line in data],
    'z': [float(line[46:54]) for line in data],
    'element': [line[76:78].strip() for line in data],
    'line': data
    })

    return df

In [None]:
# Use function to read PDB file and return a Pandas df
df = read_pdb('3htb.pdb') 

# Keep only ATOM lines (protein) and HETATM lines where the residue name is JZ4
filtered = df[(df['record'] == 'ATOM') | ((df['record'] == 'HETATM') & (df['res_name'] == 'JZ4'))]

# Save to a new PDB file with the desired atoms
with open('3htb_clean.pdb', 'w') as out:
    out.writelines(filtered['line'].tolist())

You should now see only the protein and JZ4 atoms in the PDB file.

![Alt text](image-1.png)

In [None]:
# Load the cleaned structure with hydrogens
view = nv.show_file('3htb_clean.pdb')
view.clear_representations()

view.add_ball_and_stick()
view

Now, we will add Hydrogen atoms using `OpenBabel`.

In [None]:
# Create OpenBabel conversion object
ob_conversion = openbabel.OBConversion()
ob_conversion.SetInFormat("pdb")  # Input file format (pdb, in this case)

# Read the structure
mol = openbabel.OBMol()
ob_conversion.ReadFile(mol, '3htb_clean.pdb',)

# Add hydrogens
mol.AddHydrogens()

# Save the structure with hydrogens added
ob_conversion.SetOutFormat("pdb")  # Output format
ob_conversion.WriteFile(mol, "3htb_H.pdb")

In [None]:
# Load the cleaned structure with hydrogens
view = nv.show_file('3htb_H.pdb')
view.clear_representations()

view.add_ball_and_stick()
view

This latter structure should contain hydrogens.

Now, we visualize the protein in a cartoon representation, while the ligand in a ball-and-stick representation.

In [None]:
pdb_file = f"3htb_H.pdb"

# Load the PDB file into MDTraj
traj = md.load(pdb_file)

# Visualize the structure using NGLView
view = nv.show_mdtraj(traj)

# Clear any default representations
view.clear_representations()

# Add cartoon representation for the protein and ball-and-stick for the ligand
view.add_cartoon(selection='protein')
view.add_ball_and_stick(selection='ligand')

# Display the structure
view


<h2 style="color:green;"> Step 5: Study the Interaction of a Proteinâ€“Ligand Complex </h2>


In structural biology and drug design, it is essential to identify how a ligand binds within a proteinâ€™s active site and which non-covalent interactions stabilize the complex.
In this exercise, we will analyze the binding of 2-propylphenol (LIG) to T4 lysozyme L99A/M102Q (PDB ID: `3HTB`).


Several tools are available for visualizing and analyzing proteinâ€“ligand interactions:

- **PyMOL** â€“ A powerful molecular visualization tool that can highlight hydrogen bonds, salt bridges, Ï€â€“Ï€ stacking, and hydrophobic surfaces in 3D.

- **LigPlot+** â€“ A standalone program that generates clear 2D schematic diagrams of proteinâ€“ligand interactions.
It automatically detects hydrogen bonds and hydrophobic contacts and is widely used in structural biology publications.


- **PLIP** (Proteinâ€“Ligand Interaction Profiler) â€“ An automated tool that detects and reports all types of non-covalent interactions between proteins and ligands, outputting both textual and 3D visualizations.



Here, we will use PyMOL to analyze the interactions between the protein and the ligand.
PyMOL allows us to visually inspect the ligand in the binding pocket, identify hydrogen bonds and hydrophobic interactions, and explore the structural features that stabilize the complex.

Please download PyMOL at:
ðŸ”— https://www.pymol.org/

We will load the 3HTB structure, isolate the ligand, and use PyMOL highlight the key interactions between 2-propylphenol and T4 lysozyme. In particular, we will highlight:
- hydrophobic residues within **4 Ã…** of the ligand (if present)
- hydrogen bonds between ligand and protein (if present)

1. Open PyMOL and load 3htb_H.pdb

2. Color the ligand in yellow (type your code in the command line __>PyMOL__ bar):

    `color yellow, resn JZ4 and elem C`

3. Display hydrogen bonds between ligand and protein in cyan:
    
    `dist hbonds, resn JZ4, polymer, mode=2`
    
    `color cyan, hbonds`

4. Display hydrophobic residues between ligand and protein in orange:
    
    `select hydrophobic_near_ligand, byres (resn JZ4 around 4 and resn ALA+VAL+LEU+ILE+PHE+TYR+TRP+MET+PRO)`
    
    `show sticks, hydrophobic_near_ligand`

    `color orange, hydrophobic_near_ligand and elem C`

5. We can label the selected protein residues to investigate the amminoacid residues interacting with the ligand:

    `label hydrophobic_near_ligand and name CA, "%s%s" % (resn, resi)`


<div class="alert alert-block alert-info"> 

What do you see? 

What are the main protein-ligand interactions? 

Which protein residues are involved? </div>

<h2 style="color:orange;"> Exercise </div>

Now, try to practice what we learned:

1. Download a Proteinâ€“Ligand Complex from RCSB Protein Data Bank;

2. Clean the Structure (remove crystallographic water molecules and other non-relevant heteroatoms);

3. Add Hydrogens;

4. Visualize the Complex: display the protein as a cartoon/ribbon and the ligand as ball-and-stick using NGLview, PyMOL, ChimeraX, VMD or another viewer;

5. Describe the Interactions: identify and describe different interactions between the ligand and the protein.