Identification of the cysteines that bind the ligand

In [1]:
from Bio import SeqIO
record = SeqIO.read("./3iwl/rcsb_pdb_3IWL.fasta", "fasta")
cys_positions = [i+1 for i, aa in enumerate(str(record.seq)) if aa == "C"]
print(cys_positions)


[12, 15, 41]


12, 15 are the relevant cysteines

In [2]:
%%bash
rm -rf library/*

# Generation of mutated sequences

Cysteine reactivity can be increased by the presence of charged groups close to them. We will create a set of mutated sequences where amino acids in a defined range around the cysteines at the 12 and 15 position are replaced by arginine and lysine.

The code below allows you to create full or partial sets of mutated sequences based on what you choose in the `variants` variable.

The output is a fasta file that contains the wild type, as well as all of the generated types.

In [3]:
from itertools import combinations, product
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from pathlib import Path
import random, gzip

# ───────────────────────────  USER SETTINGS  ────────────────────────────
fasta_in     = "./3iwl/rcsb_pdb_3IWL.fasta"
design_range = range(5, 22)          # inclusive (5–21)
catalytic    = {12, 15}              # MUST stay C
variants     = "random"              # "single", "double", "full", "random"
lib_size     = 200000                # only used for "random"
library_path = Path("library/library.fasta.gz")   # .gz → compressed, .fasta → plain
# ────────────────────────────────────────────────────────────────────────

record  = SeqIO.read(fasta_in, "fasta")
wt_seq  = str(record.seq)

# positions we’re allowed to mutate
positions = [p for p in design_range if p not in catalytic]
choices   = {pos: ("K", "R") for pos in positions}

# ── build mut_lists according to the library type ───────────────────────
mut_lists = []

if variants == "single":
    mut_lists = [[(p, aa)] for p in positions for aa in choices[p]]

elif variants == "double":
    mut_lists = [[(p1, a1), (p2, a2)]
                 for (p1, p2) in combinations(positions, 2)
                 for (a1, a2) in product(choices[p1], choices[p2])]

elif variants == "full":
    per_site = [(*choices[p], None) for p in positions]          # None = keep WT
    for aa_tuple in product(*per_site):
        mut = [(p, aa) for p, aa in zip(positions, aa_tuple) if aa is not None]
        if mut:                                                  # drop all-WT case
            mut_lists.append(mut)

elif variants == "random":
    seen, rng = set(), random.Random(42)
    while len(mut_lists) < lib_size:
        k   = rng.randint(1, len(positions))                     # # sites to mutate
        pos = rng.sample(positions, k)
        mut = [(p, rng.choice(choices[p])) for p in pos]
        sig = tuple(sorted(mut))
        if sig not in seen:
            seen.add(sig)
            mut_lists.append(mut)

else:
    raise ValueError("variants must be one of: single, double, full, random")

# ── assemble SeqRecord objects ──────────────────────────────────────────
base_id = record.id.split("|", 1)[0]           # remove any upstream pipes
records = []

# optional: include wild-type first
records.append(
    SeqRecord(Seq(wt_seq),
              id="WT|WT|WT",         # three fields separated by “|”
              description="wild-type")
)


for i, muts in enumerate(mut_lists, 1):
    seq_list = list(wt_seq)
    tags     = []
    for pos, aa in muts:
        idx = pos - 1
        tags.append(f"{wt_seq[idx]}{pos}{aa}")
        seq_list[idx] = aa

    var_seq  = "".join(seq_list)
    tag_str  = "-".join(tags)
    var_id   = f"{base_id}|VAR{i:05d}|{tag_str}"
    records.append(SeqRecord(Seq(var_seq), id=var_id, description=""))

print(f"Built {len(records):,} total sequences (WT + variants)")

# ── write them all at once ──────────────────────────────────────────────
library_path.parent.mkdir(parents=True, exist_ok=True)

if library_path.suffix == ".gz":
    with gzip.open(library_path, "wt") as handle:
        SeqIO.write(records, handle, "fasta")
else:
    with library_path.open("wt") as handle:
        SeqIO.write(records, handle, "fasta")

size_mb = library_path.stat().st_size / 1_000_000
print(f"Wrote {len(records):,} sequences → {library_path}  ({size_mb:.2f} MB)")


Built 200,001 total sequences (WT + variants)
Wrote 200,001 sequences → library/library.fasta.gz  (4.07 MB)


[sbPCR](https://github.com/wendao/sbPCR) is a machine learning model that can predict the reactivity of cysteines based on just the sequence. The accompanying paper by Wang et al. can be found [here](https://pubs.acs.org/doi/10.1021/acs.biochem.7b00897).
Please note that you need `libsvm` installed in order to use sbPCR.

In [4]:
%%bash
git clone https://github.com/wendao/sbPCR

Cloning into 'sbPCR'...


We clear the sbPCR_run directory from previous results and extract the contents of our sequence library. We then pass the sequences into sbPCR as outlined in the documentation. Please note that the "Accuracy" prediction in the output is an artifact of `libsmv`and has no bearing on this task.

In [5]:
%%bash
rm -rf sbPCR_run && mkdir sbPCR_run
gunzip -c library/library.fasta.gz > library/library.fasta

cd sbPCR_run
python ../sbPCR/scripts/get_align.py  ../library/library.fasta
python ../sbPCR/scripts/get_feature.py
svm-predict  formated_input  ../sbPCR/models/train_v1.model  predict


Accuracy = 88.2122% (529276/600003) (classification)


We can now extract the results from the sbPCR run. We calculate the number of sequences deemed reactive by sbPCR as a percentage of the whole set, and store it in `filtered_sequences.fasta`. On average, about 10% of sequences are deemed reactive by sbPCR. Notably, the original wild type is NOT deemed reactive. sbPCR judges reactivity based on empirical data, in which the top 10% most reactive samples received the label "reactive".

In [9]:
import pandas as pd, pathlib
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio import SeqIO

wd = pathlib.Path("sbPCR_run")

# 1 ─ read sbPCR outputs (label only)
align = pd.read_csv(
    wd / "align",  sep=r"\s+",  header=None,
    names=["variant", "pos", "window"]
)
pred  = pd.read_csv(
    wd / "predict", sep=r"\s+", header=None,
    names=["label"], dtype={"label": int}
)

data = pd.concat([align, pred], axis=1)

# 2 ─ is WT reactive at both catalytic sites?
wt_rows = data[data.variant.str.startswith("WT") & data.pos.isin([12, 15])]
wt_reactive = (len(wt_rows) == 2) and (wt_rows["label"] == 1).all()
print("WT reactive at both cysteines = ", wt_reactive)

# 3 ─ variants with *both* sites reactive
both_hits = (
    data.query("pos in [12, 15] and label == 1")
        .groupby("variant")
        .size()
        .loc[lambda s: s == 2]
        .index
)

n_hits = len(both_hits)
lib_size = data.variant.nunique()
pct = n_hits / lib_size * 100

print(f"{n_hits}/{lib_size} = {pct:.2f}% variants have both Cys 12 and Cys 15 predicted reactive")

# 4 ─ Load full-length sequences and map VAR IDs

# Load full sequences as dictionary: {full_id : SeqRecord}
full_sequences = SeqIO.to_dict(SeqIO.parse("library/library.fasta", "fasta"))

# Build mapping: {VARxxxxx : SeqRecord}
id_map = {}
for full_id, record in full_sequences.items():
    parts = full_id.split("|")
    if len(parts) >= 2:
        var_id = parts[1]  # e.g. VAR00020
        id_map[var_id] = record

# 5 ─ Extract filtered variants
filtered = data[data.variant.isin(both_hits)].drop_duplicates("variant")

# 6 ─ Build SeqRecords using full-length sequences

# First, include the wild-type
wt_record = None
for full_id, record in full_sequences.items():
    if full_id.startswith("WT|WT|WT"):
        wt_record = SeqRecord(record.seq, id="WT", description="wild-type")
        break

# Collect records: WT first, then variants
records = []

if wt_record:
    records.append(wt_record)

# Add variants
records.extend([
    SeqRecord(
        id_map[row.variant].seq,
        id=row.variant,
        description=""
    )
    for row in filtered.itertuples()
])

# 7 ─ Write full-length filtered sequences
SeqIO.write(records, "filtered_fastas/filtered_sequences.fasta", "fasta")


WT reactive at both cysteines =  False
20446/200001 = 10.22% variants have both Cys 12 and Cys 15 predicted reactive


20447

In [18]:
#!/usr/bin/env python3
"""
Minimal “CIF → PROPKA” workflow that prints one reliable pKa per
titratable cysteine.

Requirements
------------
  gemmi     (pip install gemmi)
  propka3   (≥ 3.5; callable as `propka3`)
"""

from pathlib import Path
import subprocess
import re
import gemmi

# ── USER SETTINGS ────────────────────────────────────────────
cif_input      = 'alphafold/VAR00020.cif'           # mmCIF file
pdb_output     = 'example_from_cif.pdb'         # temporary PDB
propka_output  = pdb_output.replace('.pdb', '.pka')
# ─────────────────────────────────────────────────────────────

# 1 ▸ CIF → PDB ------------------------------------------------------------
doc = gemmi.cif.read_file(cif_input)
gemmi.make_structure_from_block(doc.sole_block()).write_pdb(pdb_output)
print(f'Wrote {pdb_output}')

# 2 ▸ PROPKA ---------------------------------------------------------------
subprocess.run(['propka3', pdb_output], check=True)
if not Path(propka_output).is_file():
    raise FileNotFoundError(f'PROPKA output {propka_output} not found')
print('PROPKA finished')

# 3 ▸ Extract one non-zero pKa per cysteine (header-independent) -----------
cys_line = re.compile(
    r'^\s*(?:\d+\s+)?'       # optional leading index
    r'CYS\s+'                # residue name
    r'(\d+)\s+'              # residue number           (group 1)
    r'([A-Za-z0-9]?)\s+'     # optional chain ID        (group 2)
    r'([0-9]+\.[0-9]+)'      # pKa value                (group 3)
)

cys_pka = {}
with open(propka_output) as f:
    for line in f:
        m = cys_line.match(line)
        if not m:
            continue

        resnum, chain, pka_str = m.groups()
        pka = float(pka_str)

        # ignore “0.00 / 99.99” artefacts and keep first non-zero value
        if 0.1 < pka < 99.0 and (chain, resnum) not in cys_pka:
            cys_pka[(chain, resnum)] = pka

# 4 ▸ Report ---------------------------------------------------------------
if cys_pka:
    print('\nCysteine pKa values')
    for (chain, resnum), pka in sorted(cys_pka.items()):
        chain_display = chain if chain else '<no-ID>'
        print(f'  CYS {chain_display}{resnum:>4}: {pka:.2f}')
else:
    print('\nNo titratable cysteines reported (likely all in disulfide bonds)')


Wrote example_from_cif.pdb
PROPKA finished

Cysteine pKa values
  CYS A  12: 8.34
  CYS A  15: 9.83
  CYS A  41: 9.41


In [22]:
#!/usr/bin/env python3
"""
Batch “CIF → PROPKA” runner for the alphafold/ directory.
Focuses on CYS 12 and CYS 15 only.

Output
------
• Wild-type (3IWL) pKa values
• Mean and s.d. for each site across all variants
• Top-10 variants with the largest |ΔpKa| at either site
"""

from pathlib import Path
import subprocess
import statistics as stats
import tempfile
import re
import gemmi
import shutil
import sys

# ────────────────────────── USER SETTINGS ──────────────────────────
folder           = Path('alphafold')     # directory containing *.cif
wild_type_file   = '3iwl.cif'            # case-exact filename
propka_exe       = 'propka3'             # command in $PATH
sites_of_interest = {12, 15}             # residue numbers
# ───────────────────────────────────────────────────────────────────

# Regex that matches a normal PROPKA Cys line (with or without leading index)
cys_line = re.compile(
    r'^\s*(?:\d+\s+)?'        # optional index
    r'CYS\s+'
    r'(\d+)\s+'               # residue number   – group 1
    r'([A-Za-z0-9]?)\s+'      # chain ID (0–1 ch) – group 2
    r'([0-9]+\.[0-9]+)'       # pKa value         – group 3
)

def extract_cys_pka(pka_file: Path) -> dict[int, float]:
    """Return {resnum: pKa} for sites_of_interest in one PROPKA .pka file."""
    result: dict[int, float] = {}
    with pka_file.open() as fh:
        for line in fh:
            m = cys_line.match(line)
            if not m:
                continue
            resnum = int(m.group(1))
            if resnum not in sites_of_interest:
                continue
            pka = float(m.group(3))
            if 0.1 < pka < 99.0 and resnum not in result:  # first non-artefact wins
                result[resnum] = pka
    return result

def run_one_cif(cif_path: Path, workdir: Path) -> dict[int, float]:
    """Convert CIF→PDB, run PROPKA in *workdir*, return pKa dict."""
    pdb_path = workdir / (cif_path.stem + '.pdb')
    # CIF → PDB (Gemmi)
    structure = gemmi.make_structure_from_block(
        gemmi.cif.read_file(str(cif_path)).sole_block()
    )
    structure.write_pdb(str(pdb_path))

    # PROPKA (run inside workdir so output lands there)
    subprocess.run([propka_exe, pdb_path.name],
                   cwd=workdir,
                   stdout=subprocess.DEVNULL,
                   stderr=subprocess.DEVNULL,
                   check=True)

    pka_file = pdb_path.with_suffix('.pka')
    if not pka_file.is_file():
        raise FileNotFoundError(f'Expected {pka_file} not written by PROPKA')

    return extract_cys_pka(pka_file)

# ────────────────────────── MAIN WORK ─────────────────────────────
if not folder.is_dir():
    sys.exit(f'Error: directory {folder} does not exist')

records: dict[str, dict[int, float]] = {}   # {variant_name: {resnum: pKa}}

with tempfile.TemporaryDirectory() as tmp:
    workdir = Path(tmp)

    for cif in sorted(folder.glob('*.cif')):
        try:
            pka_dict = run_one_cif(cif, workdir)
            if pka_dict:            # store only if at least one site was found
                records[cif.stem] = pka_dict
        except Exception as exc:
            print(f'[WARN] {cif.name}: {exc}', file=sys.stderr)

# ensure wild-type is present
wt_name = Path(wild_type_file).stem
if wt_name not in records:
    sys.exit(f'Wild-type file {wild_type_file} did not yield data – aborting.')

wt = records[wt_name]

# ────────────────────────── STATISTICS ────────────────────────────
stats_summary = {site: [] for site in sites_of_interest}
for rec in records.values():
    for site in sites_of_interest:
        if site in rec:
            stats_summary[site].append(rec[site])

# rank variants by maximum |ΔpKa| at the two sites
ranked: list[tuple[float, str, dict[int, float]]] = []
for name, rec in records.items():
    if name == wt_name:
        continue
    delta = max(
        abs(rec.get(site, wt[site]) - wt[site])
        for site in sites_of_interest
        if site in wt
    )
    ranked.append((delta, name, rec))

ranked.sort(reverse=True)
top10 = ranked[:10]

# ────────────────────────── REPORT ───────────────────────────────
print('\nWild-type pKa values (3IWL)')
for site in sorted(sites_of_interest):
    if site in wt:
        print(f'  CYS {site:>4}: {wt[site]:.2f}')
    else:
        print(f'  CYS {site:>4}: not titratable (disulfide)')

print('\nSummary statistics across all variants')
for site in sorted(sites_of_interest):
    values = stats_summary[site]
    if values:
        mean = stats.mean(values)
        sd   = stats.stdev(values) if len(values) > 1 else 0.0
        print(f'  CYS {site:>4}: mean = {mean:.2f}, sd = {sd:.2f}, n = {len(values)}')
    else:
        print(f'  CYS {site:>4}: no data')

print('\nTop 10 variants by |ΔpKa| (largest site difference)')
for delta, name, rec in top10:
    parts = []
    for site in sorted(sites_of_interest):
        if site in rec and site in wt:
            diff = rec[site] - wt[site]
            parts.append(f'CYS {site} {rec[site]:.2f} (Δ{diff:+.2f})')
    print(f'  {name:25}  {"; ".join(parts)}')



Wild-type pKa values (3IWL)
  CYS   12: 9.25
  CYS   15: 6.71

Summary statistics across all variants
  CYS   12: mean = 8.79, sd = 0.64, n = 2
  CYS   15: mean = 8.27, sd = 2.21, n = 2

Top 10 variants by |ΔpKa| (largest site difference)
  VAR00020                   CYS 12 8.34 (Δ-0.91); CYS 15 9.83 (Δ+3.12)
