# Environment Setup & Checks

In [11]:
# %%capture
!nvidia-smi

In [2]:
import time
time.sleep(0.1)

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

### GIT Configuration (Just run when you are plannig to use git)

In [12]:
# %%capture
# Ignore it if you are not cloning from GIT
# pick a folder and go there
!mkdir -p ./Ubi && cd ./Ubi

# clone
!git clone https://github.com/ashkanfard132/Ubi.git

# enter the repo
%cd Ubi

In [3]:
!pip -q install unzip

[0m

In [4]:
!unzip -q ./Ubi-data.zip -d ./data


### Set the project location and install the library.

In [1]:
# If You are using clone, Ignore it
# %%capture
%cd ./Downloads

/user/HS402/aa07623/Downloads


In [8]:
# %%capture
!pip install -r requirements.txt


# Machine Learning

## Option for **ML pipeline**

**Make sure on Google Colab you are using !python instead of !python3**

**Model Config**
`--model {rf, xgb, ada, cat, svm, lreg}`

* **rf** – Random Forest
* **xgb** – XGBoost
* **ada** – AdaBoost
* **cat** – CatBoost
* **svm** – Support Vector Machine
* **lreg** – Logistic Regression
  *(Sklearn models; ignore CUDA/epochs/lr. Optional SVM grid: `--svm-grid` with `--svm-kernels`, `--svm-C`, `--svm-gamma`.)*

**Dataset**

* Plant 1: `--fasta <path>` and `--excel <path>` (required)
* Plant 2 (optional merge): `--fasta2 <path>` and `--excel2 <path>`
* Optional PSSM folders aligned to each FASTA’s order: `--pssm1 <dir>` (plant 1), `--pssm2 <dir>` (plant 2)

**Features** (choose one or many; space-separated after `--features`)

* `aac` – amino-acid composition (windowed)
* `dpc` – dipeptide composition
* `tpc` – tripeptide composition
* `pssm` – position-specific scoring matrix features (requires `--pssm1/--pssm2`)
* `physicochem` – physicochemical descriptors
* `binary` – one-hot window encoding
  Default: `aac dpc tpc`

**Context window**
`--window_size` (odd integer, 5–35). Typical: **21**; smaller is faster, larger gives more context.

**Sampling (class imbalance)**
`--sampling {over, under, smote, smotee, none} [--sampling_ratios ...]`
You can chain steps (e.g., `--sampling over smote --sampling_ratios 1.0 0.5`). Default: `none`.

**Cross-validation**
`--cv 1` → single split (80/20 test; with a small val from train).
`--cv k` → k-fold CV with averaged metrics.

**Decision threshold (ML only)**
`--best_threshold` → learns the test threshold on the **validation** set (optimizes F2).

**Plots**
`--plots distribution confusion roc pr [feature_distribution feature_correlation feature_scatter boxplot violinplot groupwise_pca overlays]`
Saved under `results/`.

**W\&B (optional)**
`--wandb` (plus `--wandb_project`, `--wandb_entity` or env `WANDB_ENTITY`). Omit `--wandb` to keep it off.

**Reproducibility**
`--seed 42` (default).

**Outputs** (in `results/`)

* `results_{model}_{timestamp}.json` (natural test)
* `results_{model}_{timestamp}_bal.json` (balanced test)
* `{model}_{features}_{window}_{timestamp}_metrics.xlsx`
* Requested plot PNGs


> Notes: Ensure the PSSM directories match the FASTA order. For SVM with quick tuning, add:
> `--svm-grid --svm-kernels rbf,linear --svm-C 0.1,1,5,10 --svm-gamma scale,auto,0.001,0.0001`.


In [19]:
# %%capture
!python3 -u main.py \
  --model rf \
  --fasta "./Triticum aestivum Plant1/Protein Sequences Triticum aestive  40% CD-Hit.fasta" \
  --excel "./Triticum aestivum Plant1/Triticum aestive -UB-Sites.xlsx" \
  --fasta2 "./Zea Mays Plant 2/CD-Hit Protein Sequences Zea mays plant-  40% CD-Hit.fasta" \
  --excel2 "./Zea Mays Plant 2/Zea mays -UB-Sites.xlsx" \
  --pssm1 "./Triticum aestivum Plant1/PSSM Fetuers for Triticum aestive Plant" \
  --pssm2 "./Zea Mays Plant 2/PSSM feature for Zea mays Plant" \
  --features binary \
  --window_size 5 \
  --device cuda \
  --sampling none \
  --sampling_ratios 1 \
  --cv 1 \
  --plots distribution confusion roc pr

In [10]:
# %%capture
!python3 -u main.py \
  --model rf \
  --fasta "./data/Data/Triticum aestivum Plant1/Protein Sequences Triticum aestive  40% CD-Hit.fasta" \
  --excel "./data/Data/Triticum aestivum Plant1/Triticum aestive -UB-Sites.xlsx" \
  --pssm1 "./data/Data/Triticum aestivum Plant1/PSSM Fetuers for Triticum aestive Plant" \
  --features binary \
  --window_size 5 \
  --device cuda \
  --sampling over \
  --sampling_ratios 1 \
  --cv 1 \
  --plots distribution confusion roc pr

In [None]:
# %%capture
!python3 -u main.py \
  --model rf \
  --fasta "./Zea Mays Plant 2/CD-Hit Protein Sequences Zea mays plant-  40% CD-Hit.fasta" \
  --excel "./Zea Mays Plant 2/Zea mays -UB-Sites.xlsx" \
  --pssm1 "./Zea Mays Plant 2/PSSM feature for Zea mays Plant" \
  --features binary \
  --window_size 5 \
  --device cuda \
  --sampling none \
  --sampling_ratios 1 \
  --cv 1 \
  --plots distribution confusion roc pr

# Deep Learning

## Option for **Deep Learning pipeline**

**Make sure on Google Colab you are using !python instead of !python3**

**Model Config**
`--model {mlp, cnn, lstm, gru, transformer, prot_bert, distil_prot_bert, esm2_t6_8m}`

* **mlp** — Feed-forward net over **feature vectors** (AAC/DPC/TPC/etc.).
* **cnn / lstm / gru / transformer** — Sequence models over **AA indices** (no handcrafted features).
* **prot\_bert / distil\_prot\_bert** — HuggingFace protein BERTs (tokenized AA text).
* **esm2\_t6\_8m** — FAIR ESM2 (tiny, tokenized AA text).

**Dataset**

* Plant 1 (required): `--fasta <path>` and `--excel <path>`
* Plant 2 (optional merge): `--fasta2 <path>` and `--excel2 <path>`
* Optional PSSM: `--pssm1 <dir>` (Plant 1), `--pssm2 <dir>` (Plant 2). Must follow FASTA order.

**Features (when they matter)**

* **mlp:** uses `--features` (space-separated) e.g. `aac dpc tpc pssm physicochem binary`. PSSM dirs are consumed here.
* **cnn/lstm/gru/transformer/prot\_bert/distil\_prot\_bert/esm2\_t6\_8m:** ignore handcrafted `--features` and PSSM; they learn **directly from sequence**.

**Context window / length**
`--window_size` (odd integer, 5–35)

* **cnn/lstm/gru/transformer:** sequences are integer-encoded from a centered window of this size.
* **prot\_bert / distil\_prot\_bert / esm2\_t6\_8m:** used as `max_length` for tokenization. Larger windows = more memory.

**Training hyperparameters (DL only)**

* Epochs & batches: `--epochs`, `--batch_size`, `--batch_size_val`
* Device: `--device {cpu|cuda}` (CUDA recommended)
* Loss: `--loss {bce,mse,mae,focal}` + optional `--pos_weight` for class imbalance (BCE)
* Optimizer: `--optim {adam,sgd,rmsprop,adamw,amsgrad}`, `--lr`, `--weight_decay`
* Scheduler: `--sched {step,exp,plateau,cosine,none}` with `--step_size`, `--gamma`, `--t_max` (cosine)
* Regularization: `--dropout` (typical: 0.3–0.5 MLP/CNN, 0.2–0.3 LSTM, 0.1–0.3 Transformer)
* Pretrained encoders: `--freeze_pretrained` (feature-extraction mode for ProtBert/ESM2)

**Class imbalance options**

* `--sampling {over, under, smote, smotee, none}` with `--sampling_ratios ...`

  * **For sequence models (cnn/lstm/gru/transformer/prot\_bert/esm2):** prefer **over/under** only. SMOTE/SMOTEE synthesize feature vectors (not sequences) and aren’t meaningful here.
  * **For mlp:** all sampling modes are applicable.

**Cross-validation**

* `--cv 1` → single split (80/20 test; small val from train).
* `--cv k` → k-fold with averaged metrics & optional per-fold logs.

**Decision threshold**

* `--best_threshold` will learn a threshold on the **validation set** (F2-optimal) and apply it to test.
  *Note: although the help string says “ML only”, the code supports this for DL models too.*

**Plots**
`--plots distribution curves confusion roc pr [feature_distribution feature_correlation feature_scatter boxplot violinplot groupwise_pca overlays]`

* `curves` shows train/val metrics per epoch (DL only).
* Images and metrics are written to `results/`.

**W\&B (optional)**

* Add `--wandb` (+ `--wandb_project`, `--wandb_entity` or env `WANDB_ENTITY`) to log runs. Omit `--wandb` to keep it off.

**Outputs (in `results/`)**

* `results_{model}_{timestamp}.json` — natural test metrics
* `results_{model}_{timestamp}_bal.json` — balanced test metrics
* `{model}_{features_or_seq}_{window}_{timestamp}_metrics.xlsx` — Excel summary
* Requested plot PNGs; training curves if `curves` is selected


> **Notes:**
> • For transformer/BERT/ESM models, monitor GPU memory; reduce `--batch_size` or `--window_size` if you hit OOM.
> • If you pass PSSM dirs with sequence models, they’ll be ignored (only MLP/feature routes use them).
> • Set `--seed` for reproducible splits and training behavior.


In [7]:
# %%capture
!python3 -u main.py \
  --model cnn \
  --fasta "./Triticum aestivum Plant1/Protein Sequences Triticum aestive  40% CD-Hit.fasta" \
  --excel "./Triticum aestivum Plant1/Triticum aestive -UB-Sites.xlsx" \
  --fasta2 "./Zea Mays Plant 2/CD-Hit Protein Sequences Zea mays plant-  40% CD-Hit.fasta" \
  --excel2 "./Zea Mays Plant 2/Zea mays -UB-Sites.xlsx" \
  --pssm1 "./Triticum aestivum Plant1/PSSM Fetuers for Triticum aestive Plant" \
  --pssm2 "./Zea Mays Plant 2/PSSM feature for Zea mays Plant" \
  --window_size 17 \
  --epochs 10 \
  --seed 42 \
  --batch_size 32 \
  --batch_size_val 16 \
  --lr 5e-5 \
  --dropout 0.2 \
  --freeze_pretrained \
  --loss bce \
  --pos_weight 1  \
  --optim adamw \
  --weight_decay 0.01 \
  --sched none \
  --step_size 4 \
  --t_max 10 \
  --gamma 0.1 \
  --device cuda \
  --sampling over \
  --sampling_ratios 95 \
  --cv 1 \
  --plots distribution confusion roc pr

In [6]:
# %%capture
!python3 -u main.py \
  --model cnn \
  --fasta "./Triticum aestivum Plant1/Protein Sequences Triticum aestive  40% CD-Hit.fasta" \
  --excel "./Triticum aestivum Plant1/Triticum aestive -UB-Sites.xlsx" \
  --pssm1 "./Triticum aestivum Plant1/PSSM Fetuers for Triticum aestive Plant" \
  --window_size 17 \
  --epochs 10 \
  --seed 42 \
  --batch_size 32 \
  --batch_size_val 16 \
  --lr 5e-5 \
  --dropout 0.2 \
  --freeze_pretrained \
  --loss bce \
  --pos_weight 1  \
  --optim adamw \
  --weight_decay 0.01 \
  --sched none \
  --step_size 4 \
  --t_max 10 \
  --gamma 0.1 \
  --device cuda \
  --sampling none \
  --sampling_ratios 1 \
  --cv 1 \
  --plots distribution confusion roc pr

In [13]:
# %%capture
!python3 -u main.py \
  --model prot_bert \
  --fasta "./Zea Mays Plant 2/CD-Hit Protein Sequences Zea mays plant-  40% CD-Hit.fasta" \
  --excel "./Zea Mays Plant 2/Zea mays -UB-Sites.xlsx" \
  --pssm1 "./Zea Mays Plant 2/PSSM feature for Zea mays Plant" \
  --window_size 5 \
  --epochs 10 \
  --seed 42 \
  --batch_size 32 \
  --batch_size_val 16 \
  --lr 5e-5 \
  --dropout 0.2 \
  --freeze_pretrained \
  --loss bce \
  --pos_weight 1  \
  --optim adamw \
  --weight_decay 0.01 \
  --sched none \
  --step_size 5 \
  --t_max 10 \
  --gamma 0.1 \
  --device cuda \
  --sampling none \
  --sampling_ratios 1 \
  --cv 1 \
  --plots distribution confusion roc pr