# Exercise 01: Exploring and Analyzing Protein Structures in the PDB Database

## Learning Objectives

In this exercise, you will learn to:
- Query the PDB database programmatically
- Extract and analyze structural quality metrics
- **Critically evaluate** data quality and make informed decisions
- **Debug** and improve code for real-world scenarios
- **Interpret** structural data in biological context

## Using AI Tools

You may use AI assistants (ChatGPT, Claude, etc.) for:
- Understanding syntax and library functions
- Debugging code errors
- Generating code snippets

However, **you must demonstrate**:
- Your own biological reasoning and interpretation
- Justification for decisions (not just "AI said so")
- Critical evaluation of results

**The exercises assess your understanding and judgment, not code generation.**

## Introduction and Basic Skills

We'll start by learning how to query PDB and extract structural information. Then you'll apply these skills to more complex, real-world problems.

In [1]:
# Check if running on Google Colab
try:
    from google.colab import drive
    is_google_colab = True
except ImportError:
    is_google_colab = False

# If on Google Colab, install the package
if is_google_colab:
    %pip install numpy==2.0.2 scipy==1.16.2 pandas==2.2.2 plotly==5.24.1 biopandas==0.4.1 pypdb==2.4 tqdm==4.67.1 py3dmol==2.4.0

# NOTE: Ignore specific warning message from ipykernel=5.5.6
import warnings
import os



Collecting biopandas==0.4.1
  Downloading biopandas-0.4.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting pypdb==2.4
  Downloading pypdb-2.4-py3-none-any.whl.metadata (3.0 kB)
Collecting py3dmol==2.4.0
  Downloading py3Dmol-2.4.0-py2.py3-none-any.whl.metadata (1.9 kB)
Downloading biopandas-0.4.1-py2.py3-none-any.whl (878 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m879.0/879.0 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdb-2.4-py3-none-any.whl (40 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.6/40.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading py3Dmol-2.4.0-py2.py3-none-any.whl (7.0 kB)
Installing collected packages: py3dmol, pypdb, biopandas
Successfully installed biopandas-0.4.1 py3dmol-2.4.0 pypdb-2.4


In [2]:
# Import libraries
import math
import requests
import json
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import pypdb
from biopandas.pdb import PandasPdb
import py3Dmol
from tqdm import tqdm


# Suppress all warnings at the Python level
warnings.filterwarnings('ignore')

# Also set environment variable to suppress warnings
os.environ['PYTHONWARNINGS'] = 'ignore'

print("✓ All libraries loaded successfully")

✓ All libraries loaded successfully


  pd_version = LooseVersion(pd.__version__)


### PDB Protein Data Bank

The RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Data Bank) is a comprehensive database for the 3D structural information of biological macromolecules. The aim of RCSB PDB is to provide open access to 3D structural data of biological macromolecules to advance research and understanding of molecular biology and biochemistry. The RCSB PDB also provides a variety of tools and resources. Users can perform simple and advanced searches based on annotations relating to sequence, structure and function. These molecules are visualized, downloaded, and analyzed by users who range from students to specialized scientists.

### Protein of interest

Today we will take a look at the FtsZ protein from E. coli. This protein is essential for bacterial cell division, forming the Z-ring that constricts to divide the cell. FtsZ is a tubulin homolog and is considered an attractive target for antimicrobial drug development.

The UNIPROT ID of this protein is P0A9A6. You can find more information about this protein at [UniProt P0A9A6](https://www.uniprot.org/uniprot/P0A9A6).

To perform a search in the PDB database, copy the uniprot id (P0A9A6) to the search box at [RCSB PDB](https://www.rcsb.org/). Be free to explore the website and the information available for this protein.

### Programmatic access to PDB

While performing search over the website is straightforward, making repeated searches to systematically analyze structures of interest is only possible using a programmatic access.

Therefore, we will use the PDB Search API to perform queries to the PDB database.

How does it work? The API lets you search the PDB database with a JSON query in a URL and retrieve results in JSON format for further extraction.

The API is well documented in the [PDB Search API documentation](https://search.rcsb.org/index.html#search-api). You can find there also [examples of queries](https://search.rcsb.org/index.html#examples).

We will use pypdb to easily access and download PDB data based on metadata like protein and ligand names.

### BioPandas

BioPandas simplifies the handling of protein structure files, such as PDB files, for computational biologists. It utilizes pandas DataFrames, widely used in data science, to work with biological macromolecule structures from PDB and MOL2 files in structural biology.

We will use it to extract the structure with the lowest resolution where it corresponds.

### 1. Querying the PDB Database

The PDB provides a REST API that we can query with JSON. Here's how to search for all structures of a protein:

In [3]:
# Build a search query for FtsZ using UniProt ID
search_dict = {
    "query": {
        "type": "terminal",
        "label": "full_text",
        "service": "full_text",
        "parameters": {"value": "P0A9A6"},  # UniProt ID for FtsZ
    },
    "return_type": "entry",
    "request_options": {
        "paginate": {"start": 0, "rows": 100},  # Get up to 100 results
        "results_content_type": ["experimental"],  # Only experimental structures
    },
}

# Send request to PDB API
response = requests.get(
    "https://search.rcsb.org/rcsbsearch/v2/query?json=" + json.dumps(search_dict)
)
data = response.json()

print(f"Found {data['total_count']} structures for FtsZ")
print(f"Retrieved {len(data['result_set'])} in this query")

Found 11 structures for FtsZ
Retrieved 11 in this query


In [4]:
# Extract PDB IDs from results
pdb_ids = [entry["identifier"] for entry in data["result_set"]]
print(f"\nFirst 10 PDB IDs: {pdb_ids[:10]}")


First 10 PDB IDs: ['6UNX', '6LL6', '6UMK', '5KOA', '1F47', '8GZX', '5HSZ', '8GZY', '5HAW', '5K58']


### 2. Extracting Structural Information

For each structure, we can extract quality metrics like resolution, R-factors, experimental method, etc. We are doing the same as the previous step but now via the pypdb library to get detailed information about each PDB structure.

In [5]:
# Example: Get detailed info for one structure
example_pdb = pdb_ids[0]
info = pypdb.get_info(example_pdb)

print(f"Structure: {example_pdb}")
print(f"Title: {info['struct']['title'][:80]}...")
print(f"Method: {info['exptl'][0]['method']}")
print(f"Year: {info['rcsb_accession_info']['deposit_date'][:4]}")

# Resolution (only for X-ray/Cryo-EM)
if "refine" in info and info["refine"]:
    resolution = info["refine"][0].get("ls_dres_high")
    if resolution:
        print(f"Resolution: {resolution} Å")

Structure: 6UNX
Title: Structure of E. coli FtsZ(L178E)-GTP complex...
Method: X-RAY DIFFRACTION
Year: 2019
Resolution: 1.4 Å


### 3. Batch Processing with Error Handling

When processing many structures, we need robust code that handles missing data:

In [6]:
def extract_structure_info(pdb_id):
    """Extract key information from a PDB entry."""
    try:
        info = pypdb.get_info(pdb_id)

        # Basic info (always present)
        result = {
            "pdb_id": pdb_id,
            "method": info["exptl"][0]["method"],
            "year": info["rcsb_accession_info"]["deposit_date"][:4],
        }

        # Resolution (may be missing for NMR)
        if "refine" in info and info["refine"]:
            result["resolution"] = info["refine"][0].get("ls_dres_high")
            result["r_work"] = info["refine"][0].get("ls_rfactor_rwork")
            result["r_free"] = info["refine"][0].get("ls_rfactor_rfree")
        else:
            result["resolution"] = None
            result["r_work"] = None
            result["r_free"] = None

        return result

    except Exception as e:
        print(f"Error processing {pdb_id}: {e}")
        return None


# Process first 20 structures as example
structures_data = []
for pdb_id in tqdm(pdb_ids[:20], desc="Processing structures"):
    data = extract_structure_info(pdb_id)
    if data:
        structures_data.append(data)

# Create DataFrame
df = pd.DataFrame(structures_data)
print(f"\n✓ Successfully processed {len(df)} structures")
df.head()

Processing structures: 100%|██████████| 11/11 [00:02<00:00,  3.99it/s]


✓ Successfully processed 11 structures





Unnamed: 0,pdb_id,method,year,resolution,r_work,r_free
0,6UNX,X-RAY DIFFRACTION,2019,1.4,0.183,0.2033
1,6LL6,X-RAY DIFFRACTION,2019,2.5,0.1836,0.2415
2,6UMK,X-RAY DIFFRACTION,2019,1.35,0.1862,0.2045
3,5KOA,X-RAY DIFFRACTION,2016,2.67,0.2301,0.2655
4,1F47,X-RAY DIFFRACTION,2000,1.95,0.205,0.251


### 4. Basic Analysis and Visualization

In [7]:
# Summary statistics
print("=== Dataset Summary ===")
print(f"Total structures: {len(df)}")
print(f"\nBy experimental method:")
print(df["method"].value_counts())

# Resolution statistics (X-ray only)
xray_df = df[df["method"] == "X-RAY DIFFRACTION"]
resolutions = xray_df["resolution"].dropna()
if len(resolutions) > 0:
    print(
        f"\nX-ray resolution range: {resolutions.min():.2f} - {resolutions.max():.2f} Å"
    )
    print(f"Mean resolution: {resolutions.mean():.2f} Å")

=== Dataset Summary ===
Total structures: 11

By experimental method:
method
X-RAY DIFFRACTION    11
Name: count, dtype: int64

X-ray resolution range: 1.35 - 2.77 Å
Mean resolution: 2.17 Å


In [8]:
# Simple visualization: Resolution distribution
# Plot X-ray resolutions
xray_res = df[df["method"] == "X-RAY DIFFRACTION"]["resolution"].dropna()
if len(xray_res) > 0:
    # Create histogram with plotly
    fig = go.Figure()

    # Add histogram
    fig.add_trace(
        go.Histogram(
            x=xray_res,
            nbinsx=15,
            opacity=0.7,
            name="Resolution Distribution",
            marker=dict(line=dict(color="black", width=1)),
        )
    )

    # Add mean line
    mean_res = xray_res.mean()
    fig.add_vline(
        x=mean_res,
        line_dash="dash",
        line_color="red",
        line_width=2,
        annotation_text=f"Mean: {mean_res:.2f} Å",
    )

    # Update layout
    fig.update_layout(
        title="Resolution Distribution (X-ray Structures)",
        xaxis_title="Resolution (Å)",
        yaxis_title="Number of Structures",
        showlegend=False,
        width=800,
        height=400,
    )

    fig.show()
else:
    print("No X-ray structures with resolution data to plot")

### 5. Quick Structure Visualization with py3Dmol

In [9]:
# Visualize one structure
best_structure = (
    df[df["method"] == "X-RAY DIFFRACTION"].nsmallest(1, "resolution").iloc[0]
)
print(
    f"Visualizing: {best_structure['pdb_id']} (Resolution: {best_structure['resolution']:.2f} Å)"
)

view = py3Dmol.view(query=f"pdb:{best_structure['pdb_id']}", width=800, height=500)
view.setStyle({"cartoon": {"color": "spectrum"}})
view.zoomTo()
view.show()

Visualizing: 6UMK (Resolution: 1.35 Å)


### 6. Structure Analysis with BioPandas

Now let's use BioPandas to analyze the actual structural data. We'll fetch a PDB file and perform a simple analysis to identify core vs surface residues based on their coordination.

In [10]:
# Fetch a specific PDB structure using BioPandas
# Let's use the best resolution structure we found earlier
pdb_id = best_structure["pdb_id"]
print(f"Analyzing structure: {pdb_id}")

# Fetch PDB file using BioPandas
ppdb = PandasPdb().fetch_pdb(pdb_id)

# Get the ATOM records (protein atoms)
atoms_df = ppdb.df["ATOM"]

print(f"Total atoms in structure: {len(atoms_df)}")
print(f"Unique residues: {atoms_df['residue_number'].nunique()}")

# Display first few rows to understand the data structure
print("\nFirst few rows of atomic data:")
atoms_df.head()

Analyzing structure: 6UMK
Total atoms in structure: 2176
Unique residues: 303

First few rows of atomic data:


Unnamed: 0,record_name,atom_number,blank_1,atom_name,alt_loc,residue_name,blank_2,chain_id,residue_number,insertion,...,x_coord,y_coord,z_coord,occupancy,b_factor,blank_4,segment_id,element_symbol,charge,line_idx
0,ATOM,1,,N,,ASP,,A,10,,...,-1.433,-11.819,-1.932,1.0,49.09,,,N,,519
1,ATOM,2,,CA,,ASP,,A,10,,...,-0.847,-10.61,-1.382,1.0,43.24,,,C,,520
2,ATOM,3,,C,,ASP,,A,10,,...,0.644,-10.802,-1.114,1.0,27.97,,,C,,521
3,ATOM,4,,O,,ASP,,A,10,,...,1.147,-11.92,-0.962,1.0,22.28,,,O,,522
4,ATOM,5,,CB,,ASP,,A,10,,...,-1.559,-10.204,-0.091,1.0,54.88,,,C,,523


### 7. B-factor Analysis

B-factors (temperature factors) indicate atomic mobility and flexibility in protein structures. Let's analyze B-factor patterns to understand protein dynamics and identify flexible regions.

In [11]:
# Get CA atoms for B-factor analysis
ca_atoms = atoms_df[atoms_df["atom_name"] == "CA"].copy()

print(f"Analyzing B-factors for {len(ca_atoms)} residues")

Analyzing B-factors for 303 residues


In [12]:
# Simple B-factor vs Residue Number Plot with gaps
# Create complete sequence with None for missing residues
min_res = ca_atoms["residue_number"].min()
max_res = ca_atoms["residue_number"].max()

# Create a complete range of residue numbers
all_residues = list(range(min_res, max_res + 1))

# Create mapping of residue number to B-factor
bfactor_dict = dict(zip(ca_atoms["residue_number"], ca_atoms["b_factor"]))

# Create complete lists with None for missing residues
complete_bfactors = [bfactor_dict.get(res, None) for res in all_residues]

missing_count = complete_bfactors.count(None)
print(f"Residue range: {min_res} to {max_res}")
print(f"Present residues: {len(ca_atoms)}")
print(f"Missing residues: {missing_count}")

# Plot with None values (plotly will create gaps automatically)
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=all_residues,
        y=complete_bfactors,
        mode="lines+markers",
        name="B-factor",
        line=dict(color="blue", width=2),
        marker=dict(size=4),
        connectgaps=False,  # This ensures gaps appear as breaks
    )
)

# Add mean line for reference
mean_bfactor = ca_atoms["b_factor"].mean()
fig.add_hline(
    y=mean_bfactor,
    line_dash="dash",
    line_color="red",
    annotation_text=f"Mean: {mean_bfactor:.1f}",
)

fig.update_layout(
    title="B-factor vs Residue Number",
    xaxis_title="Residue Number",
    yaxis_title="B-factor (Ų)",
    width=1200,
    height=500,
    showlegend=False,
)

fig.show()

print(f"\nB-factor statistics:")
print(f"Mean: {mean_bfactor:.1f} Ų")
print(f"Range: {ca_atoms['b_factor'].min():.1f} - {ca_atoms['b_factor'].max():.1f} Ų")

Residue range: 10 to 316
Present residues: 303
Missing residues: 4



B-factor statistics:
Mean: 20.7 Ų
Range: 10.0 - 82.0 Ų


---

# Generic Exercises - Critical Analysis

Now that you understand the basics, you'll work on more challenging problems that require **critical thinking**, **debugging**, and **biological interpretation**.

These exercises cannot be solved by simply asking AI to generate code - they require your judgment and understanding.

## Exercise 1: Code Debugging and Improvement

### Background

A colleague wrote code to find the structure with best resolution. However, **it has multiple bugs and doesn't work properly for real-world data**.

### The Buggy Code

In [None]:
# BUGGY CODE - Your job is to understand, fix, and improve it
def find_best_structure(uniprot_id):
    search_dict = {
        "query": {
            "type": "terminal",
            "service": "full_text",
            "parameters": {"value": uniprot_id},
        },
        "return_type": "entry",
    }

    response = requests.get(
        "https://search.rcsb.org/rcsbsearch/v2/query?json=" + json.dumps(search_dict)
    )
    data = response.json()

    pdb_ids = []
    for entry in data["result_set"]:
        pdb_ids.append(entry["identifier"])

    best_res = 0
    best_pdb = None

    for pdb_id in pdb_ids:
        info = pypdb.get_info(pdb_id)
        resolution = info["refine"][0]["ls_dres_high"]

        if resolution > best_res:
            best_res = resolution
            best_pdb = pdb_id

    return best_pdb

## Your Tasks

Analyze the buggy code by adding detailed comments to explain each section.
Identify and explain at least 3 bugs with their impacts, provide a corrected version with proper error handling,
and justify how your improvements make the code more robust for real-world usage.

## Your Work Area

In [None]:
# Task 1.1: Annotated version
# Below is the buggy code with detailed comments by you explaining each section
## Comment 1:
## This first section builds a search object for an API query that searches for
## protein structures in the PDB that correspond to a specific UniProt ID.

def find_best_structure_with_comments(uniprot_id):
    search_dict = {
        "query": {
            "type": "terminal",
            "service": "full_text",
            "parameters": {"value": uniprot_id},
        },
        "return_type": "entry",
    }

## Comment 2:
## This section sends the search query to the RCSB PDB database,
## receives the response, and converts it into a Python object.

    response = requests.get(
        "https://search.rcsb.org/rcsbsearch/v2/query?json=" + json.dumps(search_dict)
    )
    data = response.json()

## Comment 3:
## The code section collects all PDB IDs from the API response,
## creates placeholders for the best resolution (best_res) and the best structure
## (best_pdb), and prepares the selection of the best structure based on quality criteria.

    pdb_ids = []
    for entry in data["result_set"]:
        pdb_ids.append(entry["identifier"])

    best_res = 0
    best_pdb = None

## Comment 4:
## The code looks at each structure it finds and retrieves information about it
## from the database. It notes how sharp (i.e. what resolution) it is.

    for pdb_id in pdb_ids:
        info = pypdb.get_info(pdb_id)
        resolution = info["refine"][0]["ls_dres_high"]


## Comment 5:
## The programme looks at all structures belonging to a protein.
## It checks how sharp (accurate) each one is and always remembers the sharpest
## structure at that moment.At the end, it returns the best structure ID.

        if resolution > best_res:
            best_res = resolution
            best_pdb = pdb_id

    return best_pdb

### Task 1.2: Bugs Found

**Bug #1:**
- **Location**: response = requests.get("https://search.rcsb.org/rcsbsearch/v2/query?json=" + json.dumps(search_dict))
- **What's wrong**: Incorrect HTTP method – GET is used, although the API expects POST
- **Why incorrect**: JSON is passed as part of the URL, which leads to errors
- **Impact**: The query may fail or return no results.

**Bug #2:**
- **Location**: best_res = 0
- **What's wrong**: Incorrect start value for the ‘best resolution’
- **Why incorrect**: Resolutions are positive numbers and smaller values are better – with 0 as the starting value, no structure will ever be smaller than 0, so best_pdb remains empty.
- **Impact**: The function returns None even though matching structures exist.

**Bug #3:**
- **Location**: if resolution > best_res:
- **What's wrong**: Incorrect comparison direction
- **Why incorrect**: Higher resolution means poorer structural quality; the comparison therefore selects the worst structure as the ‘best’.
- **Impact**: The programme delivers the wrong result (a less accurate structure).

In [13]:
# Task 1.3: Fixed version
# Here is the corrected and improved version of the code
def fixed_find_best_structure(uniprot_id):
    search_dict = {
        "query": {
            "type": "terminal",
            "service": "full_text",
            "parameters": {"value": uniprot_id},
        },
        "return_type": "entry",
    }

    response = requests.post(
        "https://search.rcsb.org/rcsbsearch/v2/query",
        json=search_dict
    )

    if response.status_code != 200:
        print("Fehler beim Abrufen der Daten:", response.status_code)
        return None

    data = response.json()


    pdb_ids = []
    for entry in data["result_set"]:
        pdb_ids.append(entry["identifier"])

    best_res = float("inf")
    best_pdb = None

    for pdb_id in pdb_ids:
        info = pypdb.get_info(pdb_id)
        if "refine" not in info or not info["refine"]:
            continue
        resolution = info["refine"][0].get("ls_dres_high")
        if resolution is None:
            continue

        if resolution < best_res:
            best_res = resolution
            best_pdb = pdb_id

    return best_pdb



### Task 1.4: Testing

I will test your improved method with an arbitrary UniProt ID to ensure it is generic and robust for different proteins.
The function should handle various edge cases and provide meaningful feedback regardless of the specific protein queried.

---

## Exercise 2: Structure Quality Assessment

### Scenario

You're planning a drug design project and found four structures:

| PDB | Method | Res (Å) | R-work | R-free | Year | Ligand |
|-----|--------|---------|--------|--------|------|--------|
| 4IDK | X-ray | 1.65 | 0.21 | 0.24 | 2015 | imatinib |
| 6OMG | X-ray | 1.95 | 0.18 | 0.22 | 2025 | none |
| 7ABC | Cryo-EM | 3.2 | N/A | N/A | 2022 | none |
| 2LOL | NMR | N/A | N/A | N/A | 2018 | none |

### Your Tasks

#### Task 2.1: Drug Design Choice
Which structure for drug design? Justify considering:
- Resolution
- R-factors (what do they mean?)
- Ligand presence
- Method trade-offs

#### Task 2.2: Dynamics Understanding
Which for understanding protein flexibility? Justify considering:
- Discuss limitations of each
- Would you use multiple? Why?

## Your Work Area

### Task 2.1: Drug Design Structure

**My choice:**

**Justification:**

### Task 2.2: Dynamics Understanding

**My choice:**

**Justification:**


---


# Project Analysis Exercises


**Important Note: Project Context**

These exercises are **NOT** part of the generic tutorial exercises above. These are **project-specific exercises** that will be attached to your Jupyter notebook and customized based on your chosen protein system.

**Your Protein Selection**: You will choose a protein of interest that will be used for studies of molecular dynamics modeling and docking in the upcoming sessions. This protein choice is crucial as it will form the foundation of your final project.

**Note: These project exercises are specific to your research context and will be customized based on your chosen protein system. The exercises below serve as examples of the types of critical analysis skills you'll need to apply to your own dataset.**

# Exercise 4: Structure Comparison

## Task Overview

Select TWO structures from your project protein with:
- Same or different method
- Different ligands (or apo vs holo) if applicable

## Your Tasks

### Task 4.1: Selection & Metadata
Why these structures? Compare metrics.

### Task 4.2: Protein Visualization
Generate images with the following characteristics:

1. **Secondary Structure Coloring:**
    - Display the structure in cartoon representation and color it by secondary structure.

2. **Domain or Motif Coloring:**
    - Display the structure in cartoon representation.
    - Color it by domains or motifs. (Information obtained in the Uniprot database, PDB database, or literature)

3. **B-factor Coloring (X-ray) or NMR bundle visualization:**
    - For the X-ray structure, display the structure in cartoon representation and color it by b-factor.
    - For the NMR structure, visualize the bundle. (An NMR bundle is a set of structures that satisfy experimental data. This set of structures is reported within one PDB file.)

4. **Ligand or Heteroatom Analysis (if present):**
    - Zoom to the ligand or heteroatoms.
    - Visualize the amino acids involved in the interaction.

**Rules for images:**
- The images must be clear and informative.
- The images should be rendered in a resolution of at least 800x600 pixels.
- The images background must be white or transparent.
- The images should not contain the Software Interface (e.g., PyMol interface).
- Use preferably Ray Tracing for image rendering.

### Task 4.3: B-factor Analysis
Extract, plot, and **interpret biologically**:
- What do B-factors reveal about dynamics?
- Differences between structures?
- Functional implications?

### Task 4.4: Critical Evaluation
If choosing ONE as primary reference:
- Which? Why?
- Quality vs. biological relevance?
- Acknowledge limitations
- When need the other?

## Your Work Area