
## **Workshop11: PROTEIN PREDICTION**
# Protein Secondary Structure Prediction
### *Fatemeh Shabaninejad*
### *Professor : Dr.Reza Rezazadegan*
---

**Course-level / Seminar Presentation Notebook**



## 1. Introduction

Proteins are fundamental biological macromolecules whose functions are determined by their three-dimensional structures.  
Protein structure is hierarchically organized into **four levels**:

1. Primary structure (amino acid sequence)  
2. Secondary structure (local conformations)  
3. Tertiary structure (overall 3D fold)  
4. Quaternary structure (multi-subunit assemblies)

This notebook focuses on **protein secondary structure prediction**, a key intermediate step between sequence analysis and full 3D structure determination.



## 2. What is Protein Secondary Structure?

Protein secondary structures are **stable, local spatial conformations** of the polypeptide backbone.

The main secondary structure elements (SSEs) are:

- **α-helix (H)**
- **β-sheet composed of β-strands (E)**
- **Coil / loop regions (C)**

Approximately **50% of residues** in globular proteins adopt either α-helical or β-strand conformations.



## 3. α-Helix Structure

Key characteristics of α-helices:

- Right-handed helical structure
- **3.6 amino acids per turn**
- Stabilized by **hydrogen bonds between residue i and i+4**
- Side chains project outward from the helix axis

**Proline** is generally absent from the middle of α-helices due to its rigid structure,  
but may occur at **helix termini**.,




## 4. β-Sheet Structure

β-sheets consist of two or more **β-strands** arranged in:

- Parallel
- Antiparallel
- Mixed orientations

Features:
- Extended zigzag backbone conformation
- Stabilized by **inter-strand hydrogen bonds**
- Hydrogen bonding often involves **long-range interactions** in the primary sequence

Surface β-strands show alternating **hydrophobic/hydrophilic** residues,  
while buried β-strands are predominantly **hydrophobic**.

https://youtu.be/AYRMQ4RwVSw?si=TvaLg9YPc6g4-zas


![ChatGPT Image Dec 26, 2025, 12_03_01 PM.png](<attachment:ChatGPT Image Dec 26, 2025, 12_03_01 PM.png>)


## 5. Protein secondary structure prediction
Protein secondary structure prediction is the computational process of assigning a conformational state to each amino acid residue in a protein sequence.
Each residue is classified into one of three possible structural states:

- **H** – α-helix
- **E** – β-strand
- **C** – coil / irregular

This prediction is based on the observation that secondary structures have regular patterns of amino acids that are stabilized by specific hydrogen-bonding interactions.




## 6. Why Predict Secondary Structure?

Secondary structure prediction is important because it:
Protein secondary structure is predicted because it provides essential insight into protein folding, function, and evolution.
Secondary structure elements are more conserved than amino acid sequences, making them valuable for identifying structural and functional similarities between proteins.

Predicting secondary structure helps to:

-Classify proteins into structural families

-Identify protein domains and functional motifs

-Improve sequence alignment, especially for distantly related proteins

-Guide tertiary (three-dimensional) structure prediction

-Support protein engineering and drug discovery

Because experimental determination of protein structure is time-consuming and expensive, computational secondary structure prediction offers a fast and cost-effective alternative.



## 7. Generations of Prediction Methods

Secondary structure prediction methods are traditionally classified into three generations:

### First Generation – Ab Initio
- Single-sequence based
- Residue propensity statistics
- Accuracy ≈ 50%
## 1. First Generation – Ab Initio (Single-sequence based)

**What is it?**

First-generation methods predict protein secondary structure using **only a single amino acid sequence** and **statistical propensities** of individual amino acids to form secondary structure elements such as **α-helices, β-sheets, and coils**.

**How do they work?**
- Rely on statistics derived from experimentally solved protein structures.
- Each amino acid is assigned a probability of forming a specific structure.

**Advantages and Limitations**
- ✔ Simple and fast
- ❌ No evolutionary information
- ❌ Low accuracy (~50–60%)

### Second Generation
- Improved statistics
- Incorporation of local sequence environment
- Accuracy ≈ 60–65%
## 2. Second Generation – Improved Local Statistical Methods

**What is it?**

Second-generation methods improve prediction accuracy by incorporating the **local sequence environment**, meaning neighboring residues influence predictions.

**How do they work?**
- Use improved statistical models such as the **GOR method**.
- Analyze interactions between nearby residues.

**Advantages and Limitations**
- ✔ Better accuracy than first generation (~60–65%)
- ❌ Still no evolutionary information
### Third Generation
- Evolutionary information (multiple sequence alignments)
- Machine learning (neural networks, HMMs)
- Accuracy ≈ 75–80%
## 3. Third Generation – Evolutionary Information and Machine Learning

**What is it?**

Third-generation methods use **multiple sequence alignments (MSA)** to extract **evolutionary information** from homologous proteins.

**How do they work?**
- PSI-BLAST identifies homologous sequences.
- Evolutionary profiles (PSSMs) are generated.
- Machine learning models (Neural Networks, HMMs) perform prediction.

**Advantages and Limitations**
- ✔ High accuracy (~75–80%)
- ✔ Captures conserved structural features
- ❌ Errors still occur, especially in complex β-sheet regions

**Representative Tool:** PSIPRED



## 12. Measuring Prediction Accuracy

The standard metric is the **Q3 score**:

Q3 = percentage of residues correctly predicted as H, E, or C

State-of-the-art methods achieve:
- **79–80% accuracy**

Remaining errors:
- Helix/strand boundary shifts
- Missed short secondary structure elements
## . Evaluating Prediction Accuracy

To objectively compare different secondary structure prediction methods, **standardized evaluation metrics** are required. The most widely used metric is **Q3 accuracy**, which measures per-residue classification performance.

## . Q3 Accuracy

### What is Q3 Accuracy?

Q3 accuracy measures the **percentage of amino acid residues correctly assigned** to one of the three major secondary structure classes:

- **α-helix (H)**
- **β-strand (E)**
- **coil / loop (C)**

It compares predicted structures against experimentally determined reference structures.

### Q3 Accuracy Formula

\[
Q3 = \frac{N_H + N_E + N_C}{N_{total}} \times 100
\]

Where:
- **N_H**: correctly predicted helix residues
- **N_E**: correctly predicted strand residues
- **N_C**: correctly predicted coil residues
- **N_total**: total number of residues

### Typical Q3 Accuracy Values

- Early statistical methods: **~60–65%**
- Modern machine-learning methods: **~75–80%**
- Advanced approaches may exceed **85%** on benchmark datasets under specific conditions

Q3 is usually reported on benchmark datasets such as **CB513** and **CASP**.

## 4. Complementary Accuracy Metrics

To address Q3 limitations, additional metrics are commonly used:

### Segment Overlap Score (SOV)
- Measures overlap between predicted and actual secondary structure segments
- Accounts for segment length, boundaries, and continuity

### Q8 Accuracy
- Uses 8 secondary structure states (e.g., 3₁₀-helix, π-helix, β-bridge)
- Provides finer structural detail but is more challenging due to class imbalance



![ChatGPT Image Dec 26, 2025, 12_47_58 PM.png](<attachment:ChatGPT Image Dec 26, 2025, 12_47_58 PM.png>)



# Important Algorithms and Programs

## Practical Examples for Each Method

---

## **1 Chou–Fasman**

### (Propensity-based, First Generation)

### Example sequence:

```
A L A E A A K L A A
```

### What Chou–Fasman does:

* Each amino acid has predefined **propensity values** for:

  * α-helix
  * β-strand
  * turn
* The method scans the sequence using a **sliding window**
* It checks whether the **average helix propensity** exceeds a threshold

### Observation:

* Alanine (A), Leucine (L), Glutamate (E), Lysine (K) all favor α-helix
* Many helix-favoring residues cluster together

### Prediction:

```
HHHHHHHHHH
```

**Interpretation:**
This region is predicted as an **α-helix** purely from intrinsic residue preferences.

---

## **2 GOR Series**

### (Information-theory based, Second Generation)

### Example sequence:

```
V T K A L E E A L K
```

### What GOR does differently:

* Still single-sequence based
* But includes **neighboring residues** (e.g., ±4 residues)
* Uses **information theory** to calculate how residue combinations affect structure

### Example logic:

* Central residue = A
* Neighbors include L, E, K
* Probability of helix given this context is high

### Prediction:

```
HHHHHHHHHH
```

**Difference from Chou–Fasman:**
GOR considers **local sequence context**, not just individual residues.

---

## **3 PREDATOR**

### (Nearest-Neighbor + limited long-range effects)

### Example sequence:

```
I V Y F T W L
```

### What PREDATOR does:

1. Breaks the sequence into short fragments
2. Searches a database of known structures
3. Finds **similar fragments**
4. Attempts to account for **β-strand pairing preferences**

### Observation:

* Similar fragments are frequently found in β-sheets
* Aromatic residues (Y, F, W) often appear in β-strands

### Prediction:

```
EEEEEEE
```

 **Key point:**
PREDATOR was one of the first methods to **try** (not fully succeed) at modeling non-local β-strand interactions.

---


## **5 HMM-Based Programs**

### (Hidden Markov Models, Fourth Generation)

### Example sequence:

```
A A A A A G G Y V I V V
```

### States:

* H = helix
* E = strand
* C = coil

### HMM logic:

* Helices usually have **minimum length**
* Transitions like H → H are more likely than H → E

### Most probable path:

```
C C H H H H H C E E E
```

### Final prediction:

```
CCHHHHHCEEE
```

 **Advantage:**
HMMs enforce **biologically realistic segment lengths and transitions**.

---

summery


| Method      | Practical idea                     |
| ----------- | ---------------------------------- |
| Chou–Fasman | Residue propensities               |
| GOR         | Local context + information theory |
| PREDATOR    | Fragment similarity                |
| HMM methods | State transition modeling          |




$$\large\text{Thank you}$$