# üß¨ UniProt Swiss-Prot Dataset Download (Enzyme vs Non-Enzyme)

This notebook only **downloads and stores** the UniProt Swiss-Prot dataset used for protein function classification.

**Goal**
- Build a binary dataset:
  - **Enzyme** (EC number exists) ‚Üí `label = 1`
  - **Non-enzyme** (EC number missing) ‚Üí `label = 0`

**Output**
- `uniprot_swissprot_enzyme_nonenzyme_FULL.csv`

This file will be reused in the modeling notebook (ProtBERT / ESM-2 embeddings + classifiers).

In [1]:
# ================================
# Imports & Setup
# ================================

import os
import time
import requests
import numpy as np
import pandas as pd
from io import StringIO

SEED = 42
np.random.seed(SEED)

print("‚úÖ Environment ready.")

‚úÖ Environment ready.


## üåê UniProt REST API Download (with Pagination)

UniProt returns results in **pages** (e.g., 500 rows per request).

We use a helper function `fetch_all()` that:
1. Sends the query to UniProt
2. Reads the TSV results into a DataFrame
3. Follows the `Link: rel="next"` header until no next page exists
4. Concatenates all pages into one DataFrame

This allows us to download **all matching Swiss-Prot proteins**, not just a limited sample.

In [3]:
# ==========================================
# UniProt REST API endpoint + pagination helper
# ==========================================

BASE = "https://rest.uniprot.org/uniprotkb/search"

def fetch_all(query, fields, batch_size=500, sleep_time=0.25, timeout=120):
    """
    Downloads ALL matching results using UniProt pagination.
    UniProt returns up to `batch_size` rows per page.
    """

    params = {
        "query": query,
        "format": "tsv",
        "fields": fields,
        "size": batch_size
    }

    session = requests.Session()
    url = BASE
    collected = []

    while True:
        r = session.get(url, params=params if url == BASE else None, timeout=timeout)
        r.raise_for_status()

        collected.append(pd.read_csv(StringIO(r.text), sep="\t"))

        link = r.headers.get("Link", "")
        if 'rel="next"' not in link:
            break

        url = link.split(";")[0].strip("<>")
        time.sleep(sleep_time)

    return pd.concat(collected, ignore_index=True) if collected else pd.DataFrame()

## üîé Query Fields and Filters

We request these fields from UniProt:
- accession, id, protein_name, organism_name, length, ec, sequence, reviewed

We filter to improve quality and consistency:
- `reviewed:true` ‚Üí Swiss-Prot only
- `fragment:false` ‚Üí remove partial fragments
- `length:[50 TO 1024]` ‚Üí keep reasonable protein lengths for transformer models

Then we create two queries:
- **Enzymes**: EC exists ‚Üí `(ec:*)`
- **Non-enzymes**: EC missing ‚Üí `(NOT ec:*)`

In [4]:
# ==========================================
# Define query fields and filters
# ==========================================

fields = ",".join([
    "accession",
    "id",
    "protein_name",
    "organism_name",
    "length",
    "ec",
    "sequence",
    "reviewed"
])

base = "(reviewed:true) AND (fragment:false) AND (length:[50 TO 1024])"

query_enzyme    = base + " AND (ec:*)"
query_nonenzyme = base + " AND (NOT ec:*)"

print("‚úÖ Queries prepared.")

‚úÖ Queries prepared.


## ‚¨áÔ∏è Download and Create Labels

We download two datasets:
1. Enzymes (EC present) ‚Üí `label = 1`
2. Non-enzymes (EC absent) ‚Üí `label = 0`

Then we concatenate them into a single dataset `df_raw`.

Finally, we print shapes so we know how many proteins were downloaded for each class.

In [5]:
print("‚¨áÔ∏è Downloading enzymes...")
df_enzyme = fetch_all(query_enzyme, fields)
df_enzyme["label"] = 1
print("Enzymes:", df_enzyme.shape)

print("‚¨áÔ∏è Downloading non-enzymes...")
df_nonenzyme = fetch_all(query_nonenzyme, fields)
df_nonenzyme["label"] = 0
print("Non-enzymes:", df_nonenzyme.shape)

df_raw = pd.concat([df_enzyme, df_nonenzyme], ignore_index=True)

print("‚úÖ Combined shape:", df_raw.shape)
print("Label counts:\n", df_raw["label"].value_counts())
df_raw.head()

‚¨áÔ∏è Downloading enzymes...
Enzymes: (267064, 9)
‚¨áÔ∏è Downloading non-enzymes...
Non-enzymes: (270282, 9)
‚úÖ Combined shape: (537346, 9)
Label counts:
 label
0    270282
1    267064
Name: count, dtype: int64


Unnamed: 0,Entry,Entry Name,Protein names,Organism,Length,EC number,Sequence,Reviewed,label
0,A0A1B0GTW7,CIROP_HUMAN,Ciliated left-right organizer metallopeptidase...,Homo sapiens (Human),788,3.4.24.-,MLLLLLLLLLLPPLVLRVAASRCLHDETQKSVSLLRPPFSQLPSKS...,reviewed,1
1,A1L3X0,ELOV7_HUMAN,Very long chain fatty acid elongase 7 (EC 2.3....,Homo sapiens (Human),281,2.3.1.199,MAFSDLTSRTVHLYDNWIKDADPRVEDWLLMSSPLPQTILLGFYVY...,reviewed,1
2,A2RUC4,TYW5_HUMAN,tRNA wybutosine-synthesizing protein 5 (hTYW5)...,Homo sapiens (Human),315,1.14.11.42,MAGQHLPVPRLEGVSREQFMQHLYPQRKPLVLEGIDLGPCTSKWTV...,reviewed,1
3,A5PLL7,PEDS1_HUMAN,Plasmanylethanolamine desaturase 1 (EC 1.14.19...,Homo sapiens (Human),270,1.14.19.77,MAGAENWPGQQLELDEDEASCCRWGAQHAGARELAALYSPGKRLQE...,reviewed,1
4,C9JRZ8,AK1BF_HUMAN,Aldo-keto reductase family 1 member B15 (EC 1....,Homo sapiens (Human),316,1.1.1.-; 1.1.1.216; 1.1.1.300; 1.1.1.54; 1.1.1.64,MATFVELSTKAKMPIVGLGTWRSLLGKVKEAVKVAIDAEYRHIDCA...,reviewed,1


## üíæ Save Dataset for Reuse

We save the combined dataset to a CSV file.

This file is the *input* to the next notebook where we:
- clean further (optional)
- sample/balance to a fixed size (optional)
- split train/test
- generate ProtBERT/ESM embeddings
- train classifiers

In [6]:
OUT_PATH = "uniprot_swissprot_enzyme_nonenzyme_FULL.csv"
df_raw.to_csv(OUT_PATH, index=False)

print("‚úÖ Saved:", OUT_PATH)
print("Final shape:", df_raw.shape)

‚úÖ Saved: uniprot_swissprot_enzyme_nonenzyme_FULL.csv
Final shape: (537346, 9)


# ‚úÖ Next: Modeling and Transformer Benchmarking

The dataset generated in this notebook can now be used for **systematic evaluation of different protein transformer models**.

Because it:
- Uses high-quality Swiss-Prot (reviewed) proteins
- Separates enzymes vs non-enzymes using EC annotation
- Applies consistent length and fragment filtering

It provides a reliable benchmark dataset for comparing:

- ProtBERT
- ESM-2 (different checkpoints: t12, t30, etc.)
- Other protein language models
- Classical ML baselines

In the next notebook, we will:
1. Load this dataset
2. Perform cleaning and optional balancing
3. Generate transformer embeddings
4. Train and evaluate multiple classifiers
5. Compare performance across different transformer models

This setup enables fair and reproducible benchmarking of protein language models for function classification.