<a href="https://colab.research.google.com/github/bforsbe/SK2534/blob/main/protein_structure_databases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Protein Structure Databases

In this exercise we will explore a series of online tools and databases for protein sequences, structures, and oligomerization:

1. **UniProt** – the universal protein resource for sequences and annotations.  
2. **PDB (Protein Data Bank)** – repository of experimentally solved 3D protein structures.  
3. **PFAM** – database of protein families and conserved domains.  
4. **PISA (Protein Interfaces, Surfaces and Assemblies)** – predicts biological assemblies from PDB structures.  
5. **AlphaFold DB** – AI-predicted structures for almost every known protein.  
6. **Foldseek** – a structure-based search engine for fast fold comparisons.  

Each section has:
- A description of the database.
- **Three questions** to test your ability to retrieve data.
- Expandable hints.
- Answer-check cells where you can type your answers (compared against a hash).

##Part 1: UniProt — The Universal Protein Resource

**What it is:** UniProt is the primary resource for protein sequences and annotations. Each protein entry contains:
- The amino acid sequence.
- Functional annotations.
- Subcellular location.
- Post-translational modifications.
- Cross-links to other databases (PDB, PFAM, AlphaFold, etc.).

👉 Website: [https://www.uniprot.org](https://www.uniprot.org)

**Question 1**

Find the length (in amino acids) of human hemoglobin beta chain (UniProt P68871).

<details> <summary>Hint</summary> Search for **P68871** in UniProt. Look under "Sequence". </details>

In [10]:
#@title Q1 answer check
# Replace this with the expected hash you want to check against
answer = "42" #@param{type:"string"}
ref_hash = "ab9c4e"  # example: hash for "hello", first 6 characters

import hashlib
import sys

def get_hash(user_input):
    try:
        # Try converting to float (numbers get formatted with 5 decimal places)
        val = float(user_input)
        normalized = f"{val:.5f}"
    except ValueError:
        # Otherwise treat as string and lowercase it
        normalized = user_input.lower()

    # Compute MD5 hash
    md5_hash = hashlib.md5(normalized.encode()).hexdigest()

    if user_input == ref_hash and md5_hash[:6] != ref_hash:
        print("That was a poor attempt to cheat. That is not how md5sum works.")
        return None, None
    else:
        return normalized, md5_hash[:6]

_, md5_hash = get_hash(answer)

if md5_hash is None:
    pass
elif md5_hash == ref_hash:
    print(f"{answer} is correct!                                ({md5_hash})")
else:
    print(f"{answer} is NOT correct!                            ({md5_hash})")

42 is NOT correct!                            (2bb038)


**Question 2**

Mutation in protein sequence is usually annotated as XNY, where X is the original amino acid, N is the position in the sequence, and Y is the mutated amino acid. For instance, mutation of an Alanine in position 23 to a Lysine would be netoded "A23K".

What mutation human cytochrome c results in increased caspase activation?

<details> <summary>Hint</summary> Check the "Subcellular location" section of the UniProt entry. </details>

In [9]:
#@title Q2 answer check
# Replace this with the expected hash you want to check against
answer = "A23K" #@param{type:"string"}
ref_hash = "e90760"  # example: hash for "hello", first 6 characters

import hashlib
import sys

def get_hash(user_input):
    try:
        # Try converting to float (numbers get formatted with 5 decimal places)
        val = float(user_input)
        normalized = f"{val:.5f}"
    except ValueError:
        # Otherwise treat as string and lowercase it
        normalized = user_input.lower()

    # Compute MD5 hash
    md5_hash = hashlib.md5(normalized.encode()).hexdigest()

    if user_input == ref_hash and md5_hash[:6] != ref_hash:
        print("That was a poor attempt to cheat. That is not how md5sum works.")
        return None, None
    else:
        return normalized, md5_hash[:6]

_, md5_hash = get_hash(answer)

if md5_hash is None:
    pass
elif md5_hash == ref_hash:
    print(f"{answer} is correct!                            ({md5_hash})")
else:
    print(f"{answer} is NOT correct!                        ({md5_hash})")

A23K is NOT correct!                              (cf1147)


**Question 3**

Motifs are recognizable feratures across proteins. They are effectively reused sequences that nature by chance repurposed instead of reinventing them.

What is the destruction motif in human NF-kappa-B inhibitor alpha?

<details> <summary>Hint</summary> P04637. Family & Domains. </details>

In [14]:
#@title Q3 answer check
# Replace this with the expected hash you want to check against
answer = "SGAGAG" #@param{type:"string"}
ref_hash = "5bb648"  # example: hash for "hello", first 6 characters

import hashlib
import sys

def get_hash(user_input):
    try:
        # Try converting to float (numbers get formatted with 5 decimal places)
        val = float(user_input)
        normalized = f"{val:.5f}"
    except ValueError:
        # Otherwise treat as string and lowercase it
        normalized = user_input.lower()

    # Compute MD5 hash
    md5_hash = hashlib.md5(normalized.encode()).hexdigest()

    if user_input == ref_hash and md5_hash[:6] != ref_hash:
        print("That was a poor attempt to cheat. That is not how md5sum works.")
        return None, None
    else:
        return normalized, md5_hash[:6]

_, md5_hash = get_hash(answer)

if md5_hash is None:
    pass
elif md5_hash == ref_hash:
    print(f"{answer} is correct!                            ({md5_hash})")
else:
    print(f"{answer} is NOT correct!                        ({md5_hash})")

SGAGAG is NOT correct!                        (f3b116)


## Part 1: PDB — The Protein Data Bank

The PDB is the central repository for experimentally determined structures of proteins, nucleic acids, and complexes. Structures are solved mainly by **X-ray crystallography**, **NMR spectroscopy**, or **cryo-EM**.

👉 Website: [https://www.rcsb.org](https://www.rcsb.org)

### Question 1
Find the **resolution (in Å)** of the structure with PDB ID **1CRN** (crambin).

<details>
<summary>Hint</summary>
Use the search bar on the RCSB site, enter **1CRN**, and look at the "Experimental Data Snapshot".
</details>

In [None]:
# CHECK CELL
# answer = "your_answer_here"

## Part 2: PFAM — Protein families and domains

PFAM groups protein sequences into **families** based on conserved domains and motifs. It helps us understand the modularity of proteins.

👉 Website: [http://pfam.xfam.org](http://pfam.xfam.org)

### Question 2
What is the **PFAM domain name** that defines the globin fold (hemoglobin, myoglobin, etc.)?

<details>
<summary>Hint</summary>
Search for "hemoglobin" in PFAM and find the domain entry for the globin fold.
</details>

In [None]:
# CHECK CELL
# answer = "your_answer_here"

## Part 3: PISA — Protein Interfaces, Surfaces, Assemblies

PISA analyzes **oligomeric states** of proteins based on structural data. It predicts whether the biological unit is a monomer, dimer, tetramer, etc.

👉 Website: [https://www.ebi.ac.uk/pdbe/pisa/](https://www.ebi.ac.uk/pdbe/pisa/)

### Question 3
For **hemoglobin (PDB ID 1A3N)**, what **oligomeric state** does PISA report?

<details>
<summary>Hint</summary>
Search for 1A3N in PISA, then check the "Assembly" tab.
</details>

In [None]:
# CHECK CELL
# answer = "your_answer_here"

## Part 4: AlphaFold DB

AlphaFold DB contains AI-predicted protein structures for almost all known sequences.  
- Predictions are scored by **pLDDT**, a confidence measure for the local accuracy of the structure.  
- It is important to remember that predictions are models, not experiments.  

👉 Website: [https://alphafold.ebi.ac.uk](https://alphafold.ebi.ac.uk)

### Question 4
What is the **predicted structure coverage** (fraction of the sequence modeled with high confidence, pLDDT > 70) for **human p53** (UniProt P04637)?

<details>
<summary>Hint</summary>
Search "P04637" in AlphaFold DB, then look at the sequence viewer color-coding and confidence summary.
</details>

In [None]:
# CHECK CELL
# answer = "your_answer_here"

### Visualization of AlphaFold models in Colab

You can fetch an AlphaFold model (from the EBI servers) and view it using **py3Dmol**:

In [None]:
import py3Dmol
import requests

def show_alphafold(uniprot_id):
    url = f"https://alphafold.ebi.ac.uk/files/AF-{uniprot_id}-F1-model_v4.pdb"
    pdb = requests.get(url).text
    view = py3Dmol.view(width=400, height=400)
    view.addModel(pdb, "pdb")
    view.setStyle({"cartoon": {"color": "spectrum"}})
    view.zoomTo()
    return view.show()

# Example: p53
show_alphafold("P04637")

## Part 5: Foldseek

Foldseek is a **structure-based search engine**. It allows you to query a protein structure against a database of known structures and find **structural neighbors**, even if the sequences are unrelated.

👉 Website: [https://search.foldseek.com](https://search.foldseek.com)

### Question 5
Upload the AlphaFold structure of human p53 (P04637) to Foldseek.  
Which **structural motif or fold** is most similar to its DNA-binding domain?

<details>
<summary>Hint</summary>
Foldseek will give you a ranked list of similar structures. Look for the annotation of the top hit.
</details>

In [None]:
# CHECK CELL
# answer = "your_answer_here"

# Wrap-up

Through this notebook, you learned how to:
- Use the PDB to find structural data.  
- Use PFAM to identify conserved domains.  
- Use PISA to analyze oligomeric states.  
- Access AlphaFold DB for predicted structures.  
- Visualize AlphaFold models with code.  
- Use Foldseek to find structural neighbors.  

This toolbox will help you **explore protein structure and oligomerization/polymerization** — and maybe even “stump the teacher” with your findings!