<a href="https://colab.research.google.com/github/VUzan-bio/DNA-Bacteria-JEPA/blob/main/dna_jepa_bacteria.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Cell 1: Enable GPU
import torch
print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Cell 2: Clone your repo
!git clone https://github.com/VUzan-bio/DNA-Bacteria-JEPA.git
%cd DNA-Bacteria-JEPA

# Cell 3: Install dependencies
!pip install -q pandas numpy scipy scikit-learn tqdm
# PyTorch is already installed in Colab with CUDA support

# Cell 4: Verify imports work
from src.cas12a.tokenizer import Cas12aTokenizer, TokenizerConfig
from src.cas12a.encoder import SparseTransformerEncoder
print("Imports successful")

GPU available: True
GPU: Tesla T4
Cloning into 'DNA-Bacteria-JEPA'...
remote: Enumerating objects: 56, done.[K
remote: Counting objects: 100% (56/56), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 56 (delta 3), reused 56 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (56/56), 23.55 MiB | 9.18 MiB/s, done.
Resolving deltas: 100% (3/3), done.
Updating files: 100% (42/42), done.
/content/DNA-Bacteria-JEPA
Imports successful


In [None]:
# Run download script
!python scripts/download_bacterial_genomes.py \
    --output-dir data/raw/bacterial_genomes

# Extract pretraining sequences
!python scripts/extract_pretraining_sequences.py \
    --genome-dir data/raw/bacterial_genomes \
    --output-csv data/processed/pretrain_sequences.csv \
    --window-size 512 \
    --stride 256


NCBI Bacterial Genome Downloader
Output directory: /content/DNA-Bacteria-JEPA/data/raw/bacterial_genomes
Total accessions: 8

[1/8]
[downloading] Escherichia_coli_K12_MG1655 (NC_000913.3)
[ok] NC_000913.3_Escherichia_coli_K12_MG1655.fasta (4.49 MB)

[2/8]
[downloading] Bacillus_subtilis_168 (NC_000964.3)
[ok] NC_000964.3_Bacillus_subtilis_168.fasta (4.08 MB)

[3/8]
[downloading] Pseudomonas_putida_KT2440 (NC_002947.4)
[ok] NC_002947.4_Pseudomonas_putida_KT2440.fasta (5.98 MB)

[4/8]
[downloading] Streptomyces_coelicolor_A3 (NC_003888.3)
[ok] NC_003888.3_Streptomyces_coelicolor_A3.fasta (8.38 MB)

[5/8]
[downloading] Lactobacillus_acidophilus_NCFM (NC_006814.1)
[ok] NC_006814.1_Lactobacillus_acidophilus_NCFM.fasta (1.93 MB)

[6/8]
[downloading] Staphylococcus_aureus_NCTC8325 (NC_007795.1)
[ok] NC_007795.1_Staphylococcus_aureus_NCTC8325.fasta (2.73 MB)

[7/8]
[downloading] Escherichia_coli_536 (NC_008253.1)
[ok] NC_008253.1_Escherichia_coli_536.fasta (4.78 MB)

[8/8]
[downloading] Acinet

In [27]:
!python scripts/01_pretrain_jepa.py \
  --epochs 10 \
  --max-samples 10000 \
  --lr 3e-5


Using device: cuda
GPU: Tesla T4
  _C._set_float32_matmul_precision(precision)
Vocab size: 9
Loaded 146223 sequences from /content/DNA-Bacteria-JEPA/data/processed/pretrain_sequences.csv
Using sampled subset: 10,000 sequences
Config: batch_size=128, grad_accum_steps=1, effective_batch_size=128, precision=bfloat16, num_workers=2

Starting pretraining for 10 epochs
Epoch 1/10: 100% 79/79 [00:45<00:00,  1.72it/s, loss=13.6983, sim=0.5492, var=1.6094, std=0.16, active=100.0%]
Epoch 1: loss=14.6539, sim=0.5997, var=1.7216, std=0.10, active=100.0%
Epoch 2/10: 100% 79/79 [00:44<00:00,  1.78it/s, loss=13.6160, sim=0.5415, var=1.5832, std=0.15, active=100.0%]
Epoch 2: loss=13.5368, sim=0.5411, var=1.5615, std=0.17, active=100.0%
Epoch 3/10: 100% 79/79 [00:44<00:00,  1.76it/s, loss=14.4258, sim=0.6103, var=1.4586, std=0.20, active=100.0%]
Epoch 3: loss=14.2416, sim=0.5882, var=1.4470, std=0.19, active=100.0%
Epoch 4/10: 100% 79/79 [00:44<00:00,  1.78it/s, loss=15.0191, sim=0.6653, var=1.4736, st

In [28]:
%cd /content/DNA-Bacteria-JEPA

!python scripts/plot_pretrain_metrics.py
!python scripts/plot_embedding_tsne.py

/content/DNA-Bacteria-JEPA
Saved figure: /content/DNA-Bacteria-JEPA/figures/pretrain_metrics.png
Using device: cuda
Samples: 10,000
  output = torch._nested_tensor_from_mask(
Embeddings shape: (10000, 384)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 10000 samples in 0.001s...
[t-SNE] Computed neighbors for 10000 samples in 1.349s...
[t-SNE] Computed conditional probabilities for sample 1000 / 10000
[t-SNE] Computed conditional probabilities for sample 2000 / 10000
[t-SNE] Computed conditional probabilities for sample 3000 / 10000
[t-SNE] Computed conditional probabilities for sample 4000 / 10000
[t-SNE] Computed conditional probabilities for sample 5000 / 10000
[t-SNE] Computed conditional probabilities for sample 6000 / 10000
[t-SNE] Computed conditional probabilities for sample 7000 / 10000
[t-SNE] Computed conditional probabilities for sample 8000 / 10000
[t-SNE] Computed conditional probabilities for sample 9000 / 10000
[t-SNE] Computed conditional probabilities for s

In [33]:
# 1. Run this first (or similar code you used earlier)
from pathlib import Path
import numpy as np
import pandas as pd
import torch

from src.cas12a.encoder import SparseTransformerEncoder
from src.cas12a.tokenizer import Cas12aTokenizer, TokenizerConfig
from scripts.plot_embedding_tsne import (
    create_encoder_from_checkpoint,
    extract_embeddings,
    compute_gc_content,
)

data_path = Path("data/processed/pretrain_sequences.csv")
checkpoint_path = Path("checkpoints/pretrain/checkpoint_epoch10.pt")

df = pd.read_csv(data_path)
sampled = df.sample(n=10000, random_state=42).reset_index(drop=True)
sequences = sampled["sequence"].astype(str).tolist()
gc_values = np.array([compute_gc_content(s) for s in sequences])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = Cas12aTokenizer(TokenizerConfig())
ckpt = torch.load(checkpoint_path, map_location=device)
encoder = create_encoder_from_checkpoint(ckpt, vocab_size=tokenizer.vocab_size).to(device)

embeddings = extract_embeddings(
    encoder=encoder,
    tokenizer=tokenizer,
    sequences=sequences,
    device=device,
    batch_size=256,
    num_workers=0,
    max_tokens=128,
)

from sklearn.decomposition import PCA
import numpy as np

pc1 = PCA(n_components=1).fit_transform(embeddings)
r = np.corrcoef(pc1.flatten(), gc_values)[0, 1]
print("Correlation(PC1, GC):", r)



Correlation(PC1, GC): -0.9242909648574494
