# Notebook 1: Exploring a Drug Target

Welcome! In this notebook you will learn about **proteins** and how scientists study them to design new medicines.

## What is a protein?

Proteins are tiny molecular machines inside our cells. They are made of long chains of **amino acids** that fold into specific 3D shapes. The shape of a protein determines what it does.

## What is drug design?

Many diseases are caused by proteins that don't work properly, or work *too much*. A **drug** is a small molecule that fits into a protein like a key fits into a lock. When the drug binds to the protein, it can turn the protein on or off.

The place where the drug binds is called the **binding site** (or binding pocket).

## Our target: Estrogen Receptor alpha (ERα)

Today we will study the **Estrogen Receptor alpha (ERα)**. This protein plays a key role in breast cancer. Many breast cancers grow because ERα is over-active. Drugs like **tamoxifen** block ERα and are used to treat breast cancer.

We will look at a crystal structure called **3ERT**, which shows ERα bound to **4-hydroxytamoxifen (OHT)** — the active form of tamoxifen.

---

Let's get started!

## Step 0: Install required packages

Run the cell below to install the Python libraries we need. This only needs to happen once per session.

In [None]:
%%capture
!pip install biopython py3Dmol

In [None]:
# Import all the libraries we'll use
import os
import warnings
import numpy as np

import py3Dmol
from Bio.PDB import PDBList, PDBParser, PDBIO, Select, NeighborSearch
from Bio.PDB.Polypeptide import is_aa

warnings.filterwarnings('ignore')

## Step 1: Set up our target

We define the PDB code of our crystal structure, the chain we want, and the ligand code.

> **What's happening here?** Every protein structure in the [Protein Data Bank (PDB)](https://www.rcsb.org/) has a unique 4-character code. Chain "A" is the specific copy of the protein we want. "OHT" is the 3-letter code for the drug molecule (4-hydroxytamoxifen) in the structure.

In [None]:
TARGET_PDB_ID = "3ERT"   # PDB code for ERα with 4-hydroxytamoxifen
PDB_CHAIN = "A"          # Chain A of the structure
LIGAND_CODE = "OHT"      # 3-letter code for 4-hydroxytamoxifen

## Step 2: Download the protein structure

We download the 3D structure directly from the Protein Data Bank using BioPython.

> **What's happening here?** Scientists determine protein structures using techniques like X-ray crystallography. The results are deposited in the PDB, a free online database. We're downloading the file that contains the exact 3D positions of every atom.

In [None]:
# Download the structure from the PDB
pdb_list = PDBList(server='https://files.wwpdb.org', verbose=False)
downloaded_file = pdb_list.retrieve_pdb_file(TARGET_PDB_ID, file_format='pdb', pdir='.')

# Rename to a cleaner filename
pdb_filename = f'{TARGET_PDB_ID}.pdb'
os.rename(downloaded_file, pdb_filename)
print(f"Downloaded {pdb_filename} successfully!")

## Step 3: Extract the chain and ligand we want

Crystal structures often contain multiple copies of the protein and extra molecules (like water). We'll extract only chain A and the OHT ligand.

> **What's happening here?** Think of the downloaded file as a big box with multiple items inside. We're picking out just the pieces we need.

In [None]:
from Bio.PDB import Structure, Model, Chain

# Parse the downloaded structure
parser = PDBParser(QUIET=True)
full_structure = parser.get_structure(TARGET_PDB_ID, pdb_filename)

# Extract chain A with only the OHT ligand
new_structure = Structure.Structure("extracted")
new_model = Model.Model(0)
new_chain = Chain.Chain(PDB_CHAIN)

for model in full_structure:
    for chain in model:
        if chain.id == PDB_CHAIN:
            for residue in chain:
                # Keep standard amino acids (the protein)
                if residue.id[0] == ' ':
                    new_chain.add(residue.copy())
                # Keep our ligand OHT
                elif residue.id[0] == f'H_{LIGAND_CODE}':
                    new_chain.add(residue.copy())

new_model.add(new_chain)
new_structure.add(new_model)

# Save the cleaned structure
io = PDBIO()
io.set_structure(new_structure)
io.save(pdb_filename)

print(f"Extracted chain {PDB_CHAIN} with ligand {LIGAND_CODE}")

## Step 4: Visualize the protein in 3D!

Now the exciting part — let's look at the protein! We use **py3Dmol** to create an interactive 3D viewer right in this notebook.

> **What's happening here?** The protein backbone is shown as a ribbon ("cartoon"). The drug molecule (OHT) is shown as coloured sticks so you can see each atom. You can **click and drag** to rotate, **scroll** to zoom, and **right-click drag** to pan.

In [None]:
# Read the PDB file
with open(pdb_filename, 'r') as f:
    pdb_data = f.read()

# Create the 3D viewer
view = py3Dmol.view(width=800, height=500)
view.addModel(pdb_data, 'pdb')

# Show protein as cartoon
view.setStyle({'chain': PDB_CHAIN, 'hetflag': False},
              {'cartoon': {'color': 'lightblue'}})

# Show the ligand (OHT) as sticks
view.setStyle({'resn': LIGAND_CODE},
              {'stick': {'colorscheme': 'magentaCarbon', 'radius': 0.2}})

view.zoomTo()
view.show()

## Step 5: Zoom into the binding site

Let's zoom in on the **binding site** — the pocket where the drug sits. We'll also show the nearby protein residues (amino acids) that interact with the drug.

> **What's happening here?** We use BioPython's `NeighborSearch` to find all protein amino acids within 5 Ångström (Å) of the ligand. 1 Å = 0.1 nanometer — incredibly small! These nearby residues form the binding pocket.

In [None]:
def get_binding_site_residues(structure, ligand_code, chain_id, radius=5.0):
    """Find protein residues within `radius` Angstrom of the ligand."""
    model = structure[0]

    # Collect ligand atoms and protein atoms
    ligand_atoms = []
    protein_atoms = []

    for chain in model:
        for residue in chain:
            if residue.get_resname() == ligand_code:
                ligand_atoms.extend(residue.get_atoms())
            elif is_aa(residue):
                protein_atoms.extend(residue.get_atoms())

    if not ligand_atoms:
        print(f"Warning: Ligand {ligand_code} not found!")
        return []

    # Search for protein atoms near the ligand
    ns = NeighborSearch(protein_atoms)
    nearby_atoms = set()
    for atom in ligand_atoms:
        neighbors = ns.search(atom.coord, radius, level='A')
        nearby_atoms.update(neighbors)

    # Get unique residues
    nearby_residues = set(atom.get_parent() for atom in nearby_atoms)
    residue_info = []
    for res in sorted(nearby_residues, key=lambda r: r.id[1]):
        residue_info.append({
            'resid': res.id[1],
            'resname': res.get_resname(),
            'chain': res.get_parent().id
        })

    return residue_info

# Find residues around the ligand
structure = parser.get_structure(TARGET_PDB_ID, pdb_filename)
binding_site = get_binding_site_residues(structure, LIGAND_CODE, PDB_CHAIN)

print(f"Found {len(binding_site)} residues within 5 \u00c5 of {LIGAND_CODE}:")
for res in binding_site:
    print(f"  {res['resname']} {res['resid']}")

In [None]:
# Visualize the binding site
with open(pdb_filename, 'r') as f:
    pdb_data = f.read()

view = py3Dmol.view(width=800, height=500)
view.addModel(pdb_data, 'pdb')

# Show protein as semi-transparent cartoon
view.setStyle({'chain': PDB_CHAIN, 'hetflag': False},
              {'cartoon': {'color': 'lightblue', 'opacity': 0.5}})

# Show the ligand as sticks
view.setStyle({'resn': LIGAND_CODE},
              {'stick': {'colorscheme': 'magentaCarbon', 'radius': 0.2}})

# Show binding site residues as thin sticks
resi_list = [res['resid'] for res in binding_site]
view.addStyle({'resi': resi_list, 'hetflag': False},
              {'stick': {'colorscheme': 'whiteCarbon', 'radius': 0.1}})

# Add residue labels
for res in binding_site:
    view.addResLabels({'resi': res['resid'], 'atom': 'CA'},
                      {'font': 'Arial', 'fontSize': 10,
                       'fontColor': 'black', 'backgroundColor': 'white',
                       'backgroundOpacity': 0.6})

# Zoom to the ligand
view.zoomTo({'resn': LIGAND_CODE})
view.show()

**Take a moment to explore the binding site!**

Can you see how the drug (magenta) is surrounded by amino acids? These amino acids form the "pocket" that holds the drug in place through various interactions:
- **Hydrogen bonds** (between polar atoms like N, O)
- **Hydrophobic interactions** (between non-polar/oily parts)
- **Van der Waals forces** (from atoms being close together)

---

## Step 6: Separate protein and ligand

For later steps (like docking), we need the protein and the drug molecule in separate files.

> **What's happening here?** We use BioPython's `Select` classes to filter atoms. `NonHetSelect` keeps only the protein amino acids. `ResSelect` keeps only our ligand.

In [None]:
class NonHetSelect(Select):
    """Select only protein residues (no ligands, no water)."""
    def accept_residue(self, residue):
        return 1 if residue.id[0] == ' ' else 0

class ResSelect(Select):
    """Select a specific residue by name (e.g. our ligand)."""
    def __init__(self, resname):
        self.resname = resname
    def accept_residue(self, residue):
        return 1 if residue.get_resname() == self.resname else 0

# Save protein and ligand separately
io = PDBIO()
io.set_structure(structure)

protein_file = f'protein-{TARGET_PDB_ID}.pdb'
ligand_file = f'ligand-{LIGAND_CODE}.pdb'

io.save(protein_file, NonHetSelect())
io.save(ligand_file, ResSelect(LIGAND_CODE))

print(f"Saved protein to: {protein_file}")
print(f"Saved ligand to:  {ligand_file}")

## Step 7: Add hydrogen atoms to the protein

Crystal structures usually don't include hydrogen atoms because they are too small to see with X-rays. But hydrogens are important for chemistry (especially hydrogen bonds), so we need to add them.

> **What's happening here?** We use **OpenBabel**, a chemistry toolkit, to add hydrogen atoms at the correct positions. This is called "protonation".

In [None]:
%%capture
# Install OpenBabel (works in Google Colab)
!apt-get install -y openbabel > /dev/null 2>&1

In [None]:
# Add hydrogens to the protein
prepped_protein_file = f'{TARGET_PDB_ID}_prepped.pdb'
!obabel {protein_file} -O {prepped_protein_file} -h 2>/dev/null

print(f"Protein with hydrogens saved to: {prepped_protein_file}")

## Step 8: Recombine protein and ligand

Now we combine the prepared protein (with hydrogens) and the ligand back into one file, so we can visualize the complete system.

In [None]:
complex_file = f'{TARGET_PDB_ID}-complex.pdb'

with open(complex_file, 'w') as outfile:
    for fname in [prepped_protein_file, ligand_file]:
        with open(fname) as infile:
            for line in infile:
                if 'END' not in line:
                    outfile.write(line)

print(f"Combined complex saved to: {complex_file}")

In [None]:
# Visualize the final prepared complex
# We load protein and ligand as separate models so bonds render correctly

with open(prepped_protein_file, 'r') as f:
    protein_data = f.read()
with open(ligand_file, 'r') as f:
    ligand_data = f.read()

view = py3Dmol.view(width=800, height=500)

# Model 0: protein
view.addModel(protein_data, 'pdb')
view.setStyle({'model': 0},
              {'cartoon': {'color': 'lightblue', 'opacity': 0.5}})

# Binding site residues as sticks (on protein model)
view.addStyle({'model': 0, 'resi': resi_list},
              {'stick': {'colorscheme': 'whiteCarbon', 'radius': 0.1}})

# Model 1: ligand (loaded separately so bonds are correct)
view.addModel(ligand_data, 'pdb')
view.setStyle({'model': 1},
              {'stick': {'colorscheme': 'magentaCarbon', 'radius': 0.2}})

view.zoomTo({'model': 1})
view.show()

The protein is now prepared and ready for docking (Notebook 3)!

---

## Summary

In this notebook, you learned how to:

1. Download a protein structure from the Protein Data Bank
2. Visualize proteins and drug molecules in 3D
3. Identify the binding site residues around a drug
4. Separate protein and ligand into individual files
5. Add hydrogen atoms to prepare the protein for computational studies

---

## Try it yourself!

Want to explore another structure? Try changing the PDB code below to **1ERE** — this is ERα bound to **estradiol** (the natural hormone, ligand code: EST).

Just change the variables and re-run the cells from Step 2 onwards!

In [None]:
# --- TRY IT YOURSELF ---
# Uncomment the lines below and re-run the notebook from Step 2

# TARGET_PDB_ID = "1ERE"   # ERα with estradiol
# PDB_CHAIN = "A"
# LIGAND_CODE = "EST"      # Estradiol