# Preprocessing

This is a neat script that downloads UniProt data and extracts subcellular localization annotations for each protein. We'll use this data to later extract features.

## Shell script

This script downloads the UniProt data and decompresses it. Takes around 3 minutes to run. We later extract the sequences and re-write them as FASTA ourselves; however, you are more than welcome to use the original FASTA files if you prefer using:

```bash
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
```

In [None]:
%%bash
set -euo pipefail
mkdir -p data/raw
cd data/raw

echo "Downloading UniProtKB/Swiss-Prot..."
wget -q ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz

In [None]:
%%bash
echo "Decompressing..."
gunzip -f data/raw/uniprot_sprot.dat.gz

## Python module

The following Python module contains the code that processes the UniProt data and extracts subcellular localization annotations for each protein.

In [None]:
import os
import re
from typing import Any, Dict, List

import pandas as pd
from Bio import SwissProt

In [None]:
INPUT_DAT = "data/raw/uniprot_sprot.dat"
INPUT_FASTA = "data/processed/nonredundant.fasta"

OUTPUT_ANN = "data/processed/annotations.csv"
OUTPUT_FASTA = "data/processed/filtered.fasta"

In [None]:
for path in [
    OUTPUT_ANN,
    OUTPUT_FASTA,
]:
    os.makedirs(os.path.dirname(path), exist_ok=True)

We manually exclude a few terms that indicates non-experimental evidence or are not specific enough to be useful for localization prediction. Additionally, we map some specific biological locations to more general terms to reduce the number of unique labels. This should be extended as needed, as this list is far from exhaustive.

In [None]:
_EXCLUDE_TERMS = {"probable", "potential", "by similarity", "prediction"}
ALLOWED_LOCS = {
    "Cytoplasm",
    "Nucleus",
    "Secreted",
    "Mitochondrion",
    "Periplasm",
    "Virion",
    "Plastid",
    "Membrane",
    "Peroxisome",
    "Endoplasmic Reticulum",
    "Golgi Apparatus",
    "Lysosome",
    "Vacuole",
    "Cell Projection",
    "Cell Surface",
    "Cell Junction",
    "Endosome",
}
_SYNONYM_MAP = {
    "host membrane": "Membrane",
    "host cell membrane": "Membrane",
    "cell membrane": "Membrane",
    "cell outer membrane": "Membrane",
    "plasma membrane": "Membrane",
    "mitochondrial matrix": "Mitochondrion",
    "mitochondrion matrix": "Mitochondrion",
    "endoplasmic reticulum lumen": "Endoplasmic Reticulum",
    "er lumen": "Endoplasmic Reticulum",
    "golgi": "Golgi Apparatus",
    # extend as needed
}

In [None]:
def _clean_and_primary(text: str) -> str:
    text = re.split(r"Note=", text, maxsplit=1)[0]
    text = re.sub(r"\{.*?\}|\(.*?\)", "", text)
    part = re.split(r"[.;]", text, maxsplit=1)[0].strip()
    if not part:
        return ""
    low = part.lower()
    if low in _SYNONYM_MAP:
        canon = _SYNONYM_MAP[low]
    else:
        canon = part.title()
    return canon if canon in ALLOWED_LOCS else ""

We exclude multi-compartment entries for this iteration. Single-label classifiers canâ€™t handle proteins annotated to two or more compartments - splitting multi-labels naively can inflate class counts and introduce bias.

Some more considerations:

+ Performance benchmarks (speed, memory) become harder to interpret when outputs are vectors rather than one label.
+ Decision-support tools often struggle to map multi-compartment calls to single ACMG evidence codes.
+ Rare two-compartment combinations will have very few examples. This can undermine learning.

Future iterations can revisit multi-label approaches once the single-label pipeline produces a good benchmark, as clinically, mis- or multi-localization can be disease-relevant, and it is always good to retain them and preserve for downstream pathway analysis.

In [None]:
def extract_protein_data(dat_file: str) -> List[Dict[str, Any]]:
    """
    Parse a UniProt .dat file and return only entries with exactly one
    experimentally-verified subcellular location from ALLOWED_LOCS.
    """
    results: List[Dict[str, Any]] = []

    try:
        handle = open(dat_file)
    except OSError as e:
        raise RuntimeError(f"Cannot open file {dat_file}: {e}")

    with handle:
        for rec in SwissProt.parse(handle):
            locs: List[str] = []

            # 1) try structured API
            if hasattr(rec, "subcellular_locations") and rec.subcellular_locations:
                for loc_tuple in rec.subcellular_locations:
                    loc = loc_tuple.location or ""
                    cleaned = _clean_and_primary(loc)
                    if cleaned:
                        locs.append(cleaned)
            else:
                # 2) fallback to scanning comments
                for comment in rec.comments:
                    if not comment.upper().startswith("SUBCELLULAR LOCATION:"):
                        continue
                    body = comment.split(":", 1)[1]
                    for piece in re.split(r"[;]", body):
                        cleaned = _clean_and_primary(piece)
                        if cleaned:
                            locs.append(cleaned)

            # exclude non-experimental evidence
            combined = " ".join(rec.comments).lower()
            if any(term in combined for term in _EXCLUDE_TERMS):
                continue

            # dedupe and require exactly one compartment
            unique = list(dict.fromkeys(locs))
            if len(unique) != 1:
                continue

            results.append(
                {
                    "entry_name": rec.entry_name,
                    "sequence": rec.sequence,
                    "localization": unique[0],
                }
            )

    return results

In [None]:
if os.path.exists(OUTPUT_ANN):
    df = pd.read_csv(OUTPUT_ANN)
    print(f"Loaded existing annotations from {OUTPUT_ANN}")
else:
    print(f"No existing annotations found, extracting from {INPUT_DAT}")
    print("Extracting protein data...")
    protein_data = extract_protein_data(INPUT_DAT)

    df = pd.DataFrame(protein_data)
    print(f"Data shape: {df.shape}")

    df.to_csv(OUTPUT_ANN, index=False)
    print(f"\nWrote annotations to {OUTPUT_ANN}!")
    print(f"DataFrame saved with {len(df)} entries")

In [None]:
print("\nDataset statistics:")
print(f"Total entries: {len(df)}")
print(f"Unique localizations: {df['localization'].nunique()}")
print(f"Average sequence length: {df['sequence'].str.len().mean():.1f}")

print("\nTop 10 most common localizations:")
print(df["localization"].value_counts().head(10))

The following code is for debugging purposes and can be removed. It simply saves the full localization distribution to a text file for later analysis. This is useful to understand the distribution of localizations in the dataset, and to ensure that the filtering is working as expected.

In [None]:
with open("data/localization_distribution.txt", "w") as f:
    f.write("Full localization distribution:\n")
    f.write(df["localization"].value_counts().to_string())
    print(
        "\nFull localization distribution saved to data/localization_distribution.txt"
    )

In [None]:
# optimization: convert string columns to categorical to save memory
df['entry_name'] = df['entry_name'].astype('category')
df['localization'] = df['localization'].astype('category')

print(f"Optimized memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
print("\nQuick data exploration:")
print("Sequence length distribution:")
print(df['sequence'].str.len().describe())

print("\nData quality checks:")
short_seqs = (df['sequence'].str.len() < 50).sum()
long_seqs = (df['sequence'].str.len() > 2000).sum()
print(f"Short sequences (<50 AA): {short_seqs}")
print(f"Long sequences (>2000 AA): {long_seqs}")

print("\nSample of processed data:")
print(df.head())

In [None]:
with open(OUTPUT_FASTA, "w") as out:
    for _, row in df.iterrows():
        header = f">{row['entry_name']}|{row['localization']}"
        seq = row["sequence"]
        out.write(f"{header}\n")

        for i in range(0, len(seq), 80):
            out.write(seq[i : i + 80] + "\n")

print(f"Wrote FASTA to {OUTPUT_FASTA}")
print(f"Generated {len(df)} sequences in FASTA format")

We now use [CD-HIT](https://github.com/weizhongli/cdhit/) to cluster the UniProtKB/Swiss-Prot FASTA file at 90% sequence identity, which is a common practice to reduce redundancy in protein datasets.

In [None]:
%%bash
INPUT_FASTA="data/processed/filtered.fasta"
OUTPUT_FASTA="data/processed/nonredundant.fasta"
THREADS=4

echo "Running CD-HIT..."
cd-hit -i "$INPUT_FASTA" \
       -o "$OUTPUT_FASTA" \
       -c 0.90 -n 5 \
       -M 16000 -T $THREADS

if command -v cd-hit &> /dev/null; then
    echo "CD-HIT is installed, proceeding with clustering..."
    cd-hit -i "$INPUT_FASTA" \
       -o "$OUTPUT_FASTA" \
       -c 0.90 -n 5 \
       -M 16000 -T $THREADS
    echo "Clustering completed, nonredundant FASTA at $OUTPUT_FASTA"
else
    echo "CD-HIT is not installed. Please install it to perform clustering."
fi