<a href="https://colab.research.google.com/github/bforsbe/SK2534/blob/main/pKa_pI_protonation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Charged residues, pH, and protonation equilibria

Proteins rely heavily on **charged amino acids** for their structure and function:

- **Asp, Glu**: usually negatively charged (deprotonated) at neutral pH  
- **Lys, Arg, His**: often positively charged (protonated)  
- **Cys, Tyr**: can ionize depending on conditions (we will get back to this)  

These residues are at the center of many specific interactions, like charge protein-protein interfaces and binding metal ions or ligands (like drug compounds). It is thus important to understand them a bit better, especially if we are going to simulate them. The characteristics of their charge state is e.g. important for:
- Folding stability  
- Enzyme catalysis  
- Protein–protein and protein–DNA interactions  

So what determines if they are charged or not? Their charge is fundamentally a balance or chemical equilibrium centered around protonation. This makes it fundamentally dependent on pH. In fact,

> postively charged amino acids are bases

and

> negatively charged amino acids are acids

this is perhaps not so surprising given the name "amino **acid**", but deserves to be mentioned. Given this, let's recap basic (no pun intended) chemistry of pH.

### what is pH?

pH is a measure of the acidity or basicity of an aqueous solution. It is defined as the negative base-10 logarithm of the activity of hydrogen ions (H⁺) in the solution. In dilute solutions, the activity is approximated by the molar concentration.

$$ \text{pH} = -\log_{10}[\text{H}^+] $$

A lower pH indicates a higher concentration of H⁺ ions and thus a more acidic solution, while a higher pH indicates a lower concentration of H⁺ ions and a more basic (alkaline) solution. A pH of 7 is considered neutral.

The presence of H⁺ ions in water arises from the **protolysis** (or autoionization) of water:

$$ \text{2H}_2\text{O} \rightleftharpoons \text{H}_3\text{O}^+ + \text{OH}^- $$

In pure water at 25°C, the concentrations of hydrogen ions H⁺ and   and OH⁻ (hydroxide ions) are equal, each at $10^{-7}\,\text{M}$. Thus, the pH of pure water is $-\log_{10}(10^{-7}) = 7$.

>It is worthwile to note that hyrdogen ions rarely occur, but that they rather form a metastable hydronium ion H₃O⁺, where the extra hydrogen tends to jump between adjacent water molecules. we will might therefore refer to them as such from now on.

Adding an acid increases the concentration of H⁺ (and thus H₃O⁺), lowering the pH. For example, a 0.1 M solution of a strong acid like HCl will have $[\text{H}^+] \approx 0.1\,\text{M}$, and thus a pH of $-\log_{10}(0.1) = 1$.

Adding a base increases the concentration of OH⁻, which in turn reduces the concentration of H₃O⁺ (due to the water protolysis equilibrium), increasing the pH. For example, a 0.01 M solution of a strong base like NaOH will have $[\text{OH}^-] \approx 0.01\,\text{M}$. Since $[\text{H}_3\text{O}^+][\text{OH}^-] = 10^{-14}$ at 25°C, $[\text{H}_3\text{O}^+] = 10^{-14} / 0.01 = 10^{-12}\,\text{M}$, and the pH will be $-\log_{10}(10^{-12}) = 12$.

* * *

### pH in human biology
- Blood plasma: **pH ~7.4**  
- Cytosol: **pH ~7.2**  
- Lysosomes: **pH ~4.5–5.0** (acidic interior)  
- Mitochondrial matrix: **pH ~7.8**  
- Stomach: **pH ~1–2**  

So proteins encounter a wide range of proton concentrations, and their side-chain protonation states adjust accordingly.

* * *

## Part 1 — pKa and protonation

The **pKa** of a titratable group is the pH at which it is 50% protonated. According to the reasoning just given, the pKa of water is 14. Note that this is not 7, pH=7 is where the concentration of hydronium H₃O⁺ and hydroxide OH⁻ ions is equal.  

We can calculate the **fraction** protonated of any compound (including amino acids of a given type) using the Henderson–Hasselbalch relation. This will allow us to talk about net effective charge in much celarer terms, so let's derive it and use it.

---

### Challenge 1: Derive the protonated fraction

Starting from the acid dissociation equilibrium

$$ HA \rightleftharpoons A^- + H^+ $$

and the definition of the acid dissociation constant

$$ K_a = \frac{[A^-][H^+]}{[HA]}, $$

**derive an expression for the protonated fraction**

$$ f_{prot} = \frac{[HA]}{[HA] + [A^-]} $$

in terms of pH and pKa.

<details>
  <summary> Hint (click to expand)</summary>
  <p> Use $\mathrm{p}K_a = -\log_{10}K_a$ and $\mathrm{pH} = -\log_{10}[H^+]$. </p>
</details>
<details>
<summary>Solution</summary>

From the equilibrium:

$$ \frac{[A^-]}{[HA]} = \frac{K_a}{[H^+]} $$

So

$$ f_{prot} = \frac{1}{1 + [A^-]/[HA]} = \frac{1}{1 + K_a/[H^+]} $$

Now substitute $K_a = 10^{-pK_a}$ and $[H^+] = 10^{-pH}$:

$$ f_{prot} = \frac{1}{1 + 10^{pH - pK_a}}. $$

This is the standard formula: at pH = pKa the fraction is 0.5, below pKa it approaches 1, above pKa it approaches 0.

</details>

## Part 2 — Plotting protonation curves

Let's visualize the protonation fractions for common amino acid side chains using the realtion we just derived.

In [None]:
# @title Plot protonation curves for amino acids
import numpy as np
import matplotlib.pyplot as plt

def frac_prot(pH, pKa):
    return 1 / (1 + 10**(pH - pKa))

pH_range = np.linspace(0, 14, 500)
pKas = {
    "Asp": 4.0,
    "Glu": 4.4,
    "His": 6.0,
    "Cys": 8.3,
    "Tyr": 10.1,
    "Lys": 10.5,
    "Arg": 12.5,
}

plt.figure(figsize=(8,6))
for aa, pKa in pKas.items():
    plt.plot(pH_range, frac_prot(pH_range, pKa), label=aa)
plt.xlabel("pH")
plt.ylabel("Fraction protonated")
plt.title("Protonation curves of amino acid side chains")
plt.legend()
plt.grid(True, ls="--", lw=0.5)
plt.show()

**Question**
If cysteine (Cys) and tyrosine(Tyr) have pKa values in the same range as the charged amino acids, why are they not classified as charged themselves?
<details>
  <summary> Answer (click to expand)</summary>
  <p>Because the chemistry involved means that at neutral pH (~7), cystein and tyrosine are neutral, as their protonation results in net zero charge of the protonated group. Deprotonation would result in negative charge, which happens at high pH. </p>
</details>


---

## Part 3 — pI of a peptide

Since a protein contains many different kinds of charged amino acids in differnt abundances, it is useful to define the pH at which we expect the charges to balance out, so that the number of postively charged amino acids  balances the negative ones. This is known as the proteins isoelectric point (pI). This is typically found by calculating the net charge as a function of pH, then find the pH where it crosses zero.

### Challenge 2: Toy peptide
Consider a peptide with one Asp (pKa = 4) and one Lys (pKa = 10.5).  
- Below pH 4: net charge = +1 (Lys protonated, Asp neutral).  
- Between 4 and 10.5: net charge = 0 (Lys protonated, Asp deprotonated).  
- Above 10.5: net charge = –1 (both deprotonated).  

So the pI is roughly the midpoint: (4 + 10.5)/2 = 7.25.

In [None]:
#@title Calculate charge curve of amino acid combinations
def net_charge(pH, acidic_groups, basic_groups):
    charge = 0.0
    for pKa in acidic_groups:
        fprot = frac_prot(pH, pKa)
        charge += (1-fprot)*(-1)
    for pKa in basic_groups:
        fprot = frac_prot(pH, pKa)
        charge += fprot*(+1)
    return charge


acids = "4.0" #@param {type:"string"}
bases = "10.5" #@param {type:"string"}
acids = [float(x) for x in acids.split(",")]
bases = [float(x) for x in bases.split(",")]

pH_range = np.linspace(0,14,500)
charges = [net_charge(pH, acids, bases) for pH in pH_range]

plt.figure(figsize=(7,5))
plt.plot(pH_range, charges)
plt.axhline(0, color='k', ls='--')
plt.xlabel("pH")
plt.ylabel("Net charge")
plt.title("Net charge of Asp+Lys dipeptide")
plt.grid(True, ls="--", lw=0.5)
plt.show()

### Challenge

1. Look up or otherwise find the pKa of GLutamine (Glu), and add it to the list of bases (comma separated).
2. Do the same for histidine to the list of acids (this is not as easy to get right).


---

## Part 4 — How many hydronium ions in a simulation box?
Remeber:
> pH is a measure of the concentration of hydrogen atoms or hydronium ions in water

A handy reference to calculate anything related to the fraction of something in water based on molarity is that pure water is **~55 M**, meaning that 1L of water contains 55 moles of water molecules. For anything very dilute in water, this holds true.

- At pH 7, $[H_3O^+] = 10^{-7}\,\text{M}$, (that is what the pH *means*) so the fraction of protonated water molecules is:

$$
\frac{10^{-7}}{55} \approx 2 \times 10^{-9}
$$

So only **1 in every 500 million** water molecules is hydronium at natural pH! I don't know about you but to me this seems oddly small. Of course, this just reflects the rate of water to spontaneously breaking into hydroxide and hydrogen compared to the rate of reforming into water. But remember: this is what determines the charge of amino acids - any small deviation from 7 reflects a **TINY** change in free hydronium/hydroxide that dictates the charge state of most charged amino acids.  

---

Lets explore some more practicalities related to this, in realtion to simulations. A typical molecular dynamics simulation box has a side of about 10 nm:

- Volume: $(10\,\text{nm})^3 = 1000\,\text{nm}^3 = 10^{-21}\,\text{L}$.
- Number of water molecules:  

$$
55\,\text{mol/L} \times 10^{-21}\,\text{L} \times N_A \approx 3 \times 10^4
$$


**Question** Based on this, what is the expected number of hydronium atoms in this box if we wanted to simulate pH 7?

<details>
  <summary> Answer (click to expand)</summary>
  <p>
  $$3 \times 10^4 \times 2 \times 10^{-9} \approx 6 \times 10^{-5}$$
That is, there is far less than one hydronium per box on average, so the most reasonable number is 0.
  </p>
</details>

**Question** Simulations need (for reasons we wont go into here) to me net 0 charge. So if your protein has a theoretical pI of 6, and a predicted net charge of -3 at pH 7, what do you do?

<details>
  <summary> Hint 1 (click to expand)</summary>
  You do NOT choose protonation states to attain net 0 charge. Why not?<p>
  </p>
</details>
<details>
  <summary> Hint 2 (click to expand)</summary>
  You do NOT add hydronium or hydroxide. Why not?<p>
  </p>
</details>
<details>
  <summary> Answer (click to expand)</summary>
  You typically add ions to the solvent that is roughly what would be expected in cells (isotonic), around 150mM. the number of ions of eahc kind (typically Na and Cl) are adjusted slightly to counter the net charge.<p>
  </p>
</details>

**Question** What is the expected number of hydronium atoms in this box if we wanted to simulate pH 2?
<details>
  <summary> Answer (click to expand)</summary>
  <p>
$$
\text{fraction} = 10^{-2}/55 \approx 2 \times 10^{-4}; \quad
\text{hydronium count} \approx 3 \times 10^4 \times 2 \times 10^{-4} \approx 6
$$
Even at strongly acidic pH, only a few hydronium ions are present. It is very rare that we simulate anything other than pH 7, but there are simulation types that use hydronium to emulate true pH and protonation. We will not dive into those during this course, but they are known as *constant-pH* simulations.
  </p>
</details>


For good measure, let's plot the hydronium concentration as a function of pH

In [None]:
# @title Plot hydronium concentration by pH
def hydronium_count(pH, box_len_nm=10):
    V_L = (box_len_nm * 1e-9)**3 * 1e3
    water_conc = 55.0
    n_water = water_conc * V_L * 6.022e23
    hydronium_frac = 10**(-pH) / water_conc
    return n_water, n_water * hydronium_frac

pH_vals = np.linspace(0,14,200)
hyds = [hydronium_count(pH)[1] for pH in pH_vals]

plt.semilogy(pH_vals, hyds)
plt.xlabel("pH")
plt.ylabel("Expected hydronium ions in 10 nm box")
plt.title("Hydronium count vs. pH")
plt.grid(True, which="both", ls="--", lw=0.5)
plt.show()

---

## Part 5 — Proton exchange rate at protein surface (diffusion-limited)

Ok, so with more hydronium in the water these charge sidechains react to become protonated or sometimes deprotoneted. The rate at which such protonation reactions occur should then depend on their concnetration, but also how often one comes into contact with a given charged sidechain. Lets focus on a single such sidechain exposed to the solvent surrounding a protein and see if we can understand how often these very rare hydronium ions come along.

Water diffuses in bulk with $D_{\text{bulk}} \approx 2 \times 10^{-9}\,\text{m}^2/\text{s}$.

In the **hydration layer at protein surfaces**, diffusion is **slowed by about 2×** compared to bulk.

Thus, we take $D_{\text{surface}} \approx D_{\text{bulk}}/2$.

Characteristic time to replace a water molecule (diameter $L \approx 0.3\,\text{nm}$):

$$
\tau_{\text{water}} \sim \frac{L^2}{6 D_{\text{surface}}}
$$

At pH 7, only 1 water in $\sim 5 \times 10^8$ is hydronium, so the effective hydronium encounter time is:

$$
\tau_{\mathrm{H_3O^+}} \approx \frac{\tau_{\text{water}}}{\text{fraction hydronium}}
$$

In [None]:
#@title Calculate hydronium exchange rates for sidechains
pH = 7 #@param{type:"number"}

if pH < 1 or pH > 14:
    print("pH must be between 1 and 14")
else:
    D_bulk = 2e-9
    D_surface = D_bulk / 2
    L = 0.3e-9

    tau_water = L**2 / (6*D_surface)
    hydronium_frac = 10**(-pH) / 55
    tau_hydronium = tau_water / hydronium_frac

    print(f"τ_water (surface): {tau_water:.2e} s (~{tau_water*1e12:.1f} ps)")
    print(f"τ_hydronium at pH 7: {tau_hydronium:.2e} s (~{tau_hydronium*1e3:.3f} ms)")

Here we are reminded just how fast things move on these tiny length scales. Because each sidechain is visited by something like 50 billion water molecules every second, it only takes a couple of milliseconds at pH 7 to encounter a hydronium ion (even though they are 1 in 500 million water molecules).

This is to say, that we can expect charged sidechains exposed to solvent to respond very quickly to changes in pH, or other macroscopic factors. It also means that what we consider fast or macroscopic in experiments, is very hard to attain in simulations.

In experiemtns, fluctuations in charge of a sidechain due to local changes in pH (e.g. inside organelles or at membrane surfaces, or interactions with other charged groups), may result in small disturbances that could contribute to dynamics or interactions with other proteins. This occurs on timescales that is very hard to model by simulations, where we also tend to avoid chemistry of this kind due to compuational limits.

There's aslo the question of protonation of residues insidde the protein, where exchange rates can be much slower. The exchange rates here can be days, or even weeks. Given the hydrophobic cor most proteins, charged amino acids thend to be solvent exposed, but they are nonetheless not uncommon inside proteins.



* * *

## Part 6 — Protein interiors: dielectric effects and slowed exchange

The environment inside a protein is significantly different from the surrounding aqueous solution, particularly concerning its **dielectric properties**.

- **Water:** has a high dielectric constant ($\varepsilon_r \approx 80$). This high value means that water molecules are very effective at screening electrostatic interactions between charged species. They can orient their dipoles around ions, effectively reducing the force of attraction or repulsion between them.
- **Protein interior:** typically has a much lower dielectric constant ($\varepsilon_r \approx 2$–$4$). This is because the protein core is often composed of nonpolar amino acid side chains (like alanine, valine, leucine, isoleucine, phenylalanine, methionine, and tryptophan) and the protein backbone, which have limited ability to reorient their dipoles to screen charges.

Let's look at the Coulomb interaction energy, which describes the potential energy between two charges:

$$
E = \frac{1}{4\pi \varepsilon_0 \varepsilon_r} \cdot \frac{q_1 q_2}{r}
$$

where:
- $E$ is the interaction energy.
- $\varepsilon_0$ is the vacuum permittivity (a constant).
- $\varepsilon_r$ is the relative permittivity (dielectric constant) of the medium.
- $q_1$ and $q_2$ are the magnitudes of the two charges.
- $r$ is the distance between the charges.

This formula shows that a **lower dielectric constant ($\varepsilon_r$) leads to a much stronger electrostatic interaction energy ($E$)** for a given distance between charges. This has significant implications for charged residues located in the protein interior. Burying a charged group in a low-dielectric environment is energetically unfavorable because the charge is poorly screened, leading to strong, unfavorable interactions with other charges or even with the protein backbone's partial charges.

This is why charged amino acids (Asp, Glu, Lys, Arg) are typically found on the surface of proteins, exposed to the high dielectric of water. When they are found in the protein interior, they are almost always involved in stabilizing interactions, such as forming **salt bridges** (an electrostatic interaction between a positively charged and a negatively charged amino acid side chain) or strong **hydrogen bonds** with other polar or charged groups that help to compensate for the energy cost of the desolvation and low dielectric environment.

**Proton exchange inside proteins:**

In addition to dielectric effects, proton exchange rates are also significantly different inside proteins compared to the surface.

- **Protein surface:** Charged side chains are exposed to the bulk solvent, allowing for rapid diffusion of hydronium and hydroxide ions and fast proton exchange. As we saw in Part 5, even at neutral pH, protonation/deprotonation can occur on the millisecond timescale for exposed residues due to the high collision frequency with water molecules and the low concentration of hydronium ions.
- **Protein interior:** Proton exchange is much slower due to **limited water access**. The hydrophobic core of a protein is designed to exclude water, which is necessary for proton transfer reactions. If a titratable residue is buried, its access to solvent is restricted, severely slowing down the rate at which it can gain or lose a proton. This means that buried charged residues can have their protonation states effectively **kinetically trapped**. Their protonation state might reflect the pH of the environment when the protein folded, and it may take a very long time (hours, days, or even weeks) for it to adjust to a change in the surrounding pH. This can impact protein stability, dynamics, and function, especially in processes that involve large conformational changes or occur in different cellular compartments with varying pH.

Understanding these factors is crucial for accurately modeling protein behavior, especially in simulations, where implicitly treating the protein interior as a uniform low-dielectric medium and ignoring the kinetics of proton exchange can lead to inaccurate predictions of pKa values and charge states for buried residues.

In [None]:
#@title Calculate the interaction strength based on permittivity
eps0 = 8.85e-12
q = 1.6e-19
r = 0.5e-9

eps_water = 80 #@param
eps_protein = 4 #@param

E_water = (1/(4*np.pi*eps0*eps_water))*q*q/r
E_protein = (1/(4*np.pi*eps0*eps_protein))*q*q/r

print(f"Coulomb energy in water: {E_water:.2e} J (~{E_water/1.6e-19:.2f} eV)")
print(f"Coulomb energy in protein interior: {E_protein:.2e} J (~{E_protein/1.6e-19:.2f} eV)")
print(f"Ratio (protein/water): {E_protein/E_water:.1f}×")