# DeepTopic

Sample notebook to train DeepTopic model.

In [1]:
import enhancerai as enhai

We can use function {func}`enhancerai.import_topics` to import data into an {class}`anndata.AnnData` object,
with the imported topics as the `AnnData.obs` and the consensus peak regions as the `AnnData.var`..

In [2]:
adata = enhai.import_topics(
    topics_folder="../../tests/data/test_topics/",
    peaks_file="../../tests/data/test.peaks.bed",
    compress=True,
    # topics_subset=["topic_1", "topic_2"], # optional subset of topics to import
)
adata

AnnData object with n_obs × n_vars = 3 × 23186
    obs: 'file_path', 'n_open_regions'
    var: 'n_topics', 'chr', 'start', 'end'

The `import_topics` function will also add a couple of columns with variables of interest to your `AnnData.obs` and `Anndata.var` (AnnData.obs.n_open_regions and AnnData.var.n_topics), which you can use to inspect and get a feel of your data.

To be able to do region to topic modelling, we'll need to add the DNA sequences to our `AnnData` object. We can do this by using {func}`enhancerai.pp.add_dna_sequence` and referencing to a local Fasta file with the `fasta_path=/path/to/local.fasta` argument. Alternatively, we can simple provide a name of a genome, which will use genomepy to download a reference genome. The DNA sequences will be located in your AnnData.varm.

In [3]:
# !pip install genomepy  # If you want to add the DNA sequences using genomepy
enhai.pp.add_dna_sequence(adata, genome_name="mm10", genome_dir="~/genomepy/")
adata.varm["dna_sequence"]

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 20/20 [00:01<00:00, 13.90it/s]
  adata.varm[code_varm_key] = sequence_df.applymap(_dna_to_code)


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,2104,2105,2106,2107,2108,2109,2110,2111,2112,2113
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chr1:194208032-194208532,C,A,C,A,C,G,T,C,C,A,...,A,A,T,G,C,A,G,C,T,A
chr1:92202766-92203266,G,A,A,A,T,T,A,T,A,T,...,T,G,A,A,T,A,A,A,C,A
chr1:92298990-92299490,C,G,T,A,G,A,A,A,G,G,...,C,A,G,C,A,G,C,A,C,C
chr1:3406052-3406552,G,A,C,C,C,A,T,G,A,A,...,T,A,T,T,G,C,C,C,T,G
chr1:183669567-183670067,G,C,C,A,T,C,A,G,G,G,...,T,T,T,A,A,A,G,A,C,A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
chrX:10603665-10604165,A,G,C,C,C,A,G,G,C,C,...,C,A,G,C,T,A,T,G,T,A
chrX:169798868-169799368,G,A,A,A,T,A,T,A,G,T,...,T,T,G,G,T,A,T,T,T,T
chrX:93282061-93282561,G,T,C,C,A,G,C,A,A,T,...,C,A,C,C,T,C,C,T,C,C
chrX:38730592-38731092,A,C,A,T,A,G,T,T,G,C,...,A,T,G,A,G,A,A,C,T,A


To train a model, we'll need to add a *split* column to our dataset, which we can do using {func}`enhancerai.pp.train_val_test`.  
We can add a `random_state` to ensure the data will be split in the same manner in the future when `shuffle=True`(default).

In [4]:
# We can split randomly on the regions
enhai.pp.train_val_test_split(
    adata, type="random", val_size=0.1, test_size=0.1, random_state=42
)

# Or, choose the chromosomes for the validation and test sets
enhai.pp.train_val_test_split(
    adata, type="chr", chr_val=["chr4", "chrX"], chr_test=["chr2", "chr3"]
)

print(adata.var["split"].value_counts())
adata.var

split
train    18289
test      3104
val       1793
Name: count, dtype: int64


Unnamed: 0_level_0,n_topics,chr,start,end,split
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
chr1:194208032-194208532,1,chr1,194208032,194208532,train
chr1:92202766-92203266,1,chr1,92202766,92203266,train
chr1:92298990-92299490,1,chr1,92298990,92299490,train
chr1:3406052-3406552,1,chr1,3406052,3406552,train
chr1:183669567-183670067,1,chr1,183669567,183670067,train
...,...,...,...,...,...
chrX:10603665-10604165,1,chrX,10603665,10604165,val
chrX:169798868-169799368,1,chrX,169798868,169799368,val
chrX:93282061-93282561,1,chrX,93282061,93282561,val
chrX:38730592-38731092,1,chrX,38730592,38731092,val


## Train

In [5]:
from enhancerai.tl.zoo import DeepTopicCNN
from enhancerai.tl.dataloaders import AnnDataModule
from enhancerai.tl.tasks import DeepTopic

# Chosen model architecture
architecture = DeepTopicCNN(num_classes=3, seq_len=2114)

# Datamodule, containing the train, validation and test dataloaders
datamodule = AnnDataModule(adata, batch_size=32, num_workers=4, in_memory=False)

# Task definition (losses, metrics, and how a training step is performed)
task = DeepTopic(lr=1e-3)

  return F.conv1d(input, weight, bias, self.stride,


In [6]:
from enhancerai.tl import Trainer

# Define the Trainer object with run information
trainer = Trainer(
    max_epochs=5, project_name="test", logger_type="wandb", experiment_name="test"
)

trainer.setup(architecture, task, datamodule)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mlukas-mahieu[0m ([33mlukas-mahieu-vib[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


In [7]:
trainer.fit()

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 4070 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/home/luna.kuleuven.be/u0166574/miniconda3/envs/enhancerai/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:652: Checkpoint directory /home/luna.kuleuven.be/u0166574/Desktop/projects/EnhancerAI/docs/notebooks/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type             | Params
---------------------------------------------------
0 | loss          | BCELoss          | 0     
1 | train_me

Epoch 0: 100%|██████████| 572/572 [01:03<00:00,  9.01it/s, v_num=ewdq, train/loss_step=0.476, val/loss=0.507, train/loss_epoch=0.565]

Metric val/loss improved. New best score: 0.507
Epoch 0, global step 572: 'val/loss' reached 0.50726 (best 0.50726), saving model to '/home/luna.kuleuven.be/u0166574/Desktop/projects/EnhancerAI/docs/notebooks/checkpoints/best_model-v20.ckpt' as top 1


Epoch 1: 100%|██████████| 572/572 [01:04<00:00,  8.91it/s, v_num=ewdq, train/loss_step=0.567, val/loss=0.354, train/loss_epoch=0.511]

Metric val/loss improved by 0.153 >= min_delta = 0.0. New best score: 0.354
Epoch 1, global step 1144: 'val/loss' reached 0.35399 (best 0.35399), saving model to '/home/luna.kuleuven.be/u0166574/Desktop/projects/EnhancerAI/docs/notebooks/checkpoints/best_model-v20.ckpt' as top 1


Epoch 2: 100%|██████████| 572/572 [01:04<00:00,  8.89it/s, v_num=ewdq, train/loss_step=0.426, val/loss=0.427, train/loss_epoch=0.491]

Epoch 2, global step 1716: 'val/loss' was not in top 1


Epoch 3: 100%|██████████| 572/572 [01:04<00:00,  8.88it/s, v_num=ewdq, train/loss_step=0.411, val/loss=0.463, train/loss_epoch=0.462]

Epoch 3, global step 2288: 'val/loss' was not in top 1


Epoch 4: 100%|██████████| 572/572 [01:04<00:00,  8.88it/s, v_num=ewdq, train/loss_step=0.520, val/loss=0.461, train/loss_epoch=0.428]

Epoch 4, global step 2860: 'val/loss' was not in top 1
`Trainer.fit` stopped: `max_epochs=5` reached.


Epoch 4: 100%|██████████| 572/572 [01:04<00:00,  8.88it/s, v_num=ewdq, train/loss_step=0.520, val/loss=0.461, train/loss_epoch=0.428]


In [8]:
trainer.test()

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


/home/luna.kuleuven.be/u0166574/miniconda3/envs/enhancerai/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing DataLoader 0: 100%|██████████| 97/97 [00:03<00:00, 26.05it/s]


In [9]:
import numpy as np

results = trainer.predict()

# Reshape list of tensors to a numpy array
results = np.vstack([x.cpu().numpy() for x in results])
results

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 725/725 [00:25<00:00, 28.79it/s]


array([[0.1067867 , 0.0046143 , 0.9341833 ],
       [0.02028829, 0.00234517, 0.98707086],
       [0.1434106 , 0.01306714, 0.910435  ],
       ...,
       [0.41960293, 0.1983275 , 0.5014036 ],
       [0.5516143 , 0.31703016, 0.14370385],
       [0.43084353, 0.36422807, 0.2906354 ]], dtype=float32)

In [None]:
## Sweep hyperparameters
sweep_config = {
    "method": "random",
    "metric": {"name": "val_loss", "goal": "minimize"},
    "max_trials": 10,
    "parameters": {
        "architecture": {
            "architecture": "DeepTopicCNN",
            "num_classes": 3,
            "seq_len": 2114,
            "num_filters": {"values": [256, 512, 1024]},
            "kernel_size": {"values": [3, 5, 7]},
        },
        "datamodule": {
            "batch_size": {"values": [16, 32, 64]},
        },
        "task": {
            "lr": {"min": 1e-5, "max": 1e-3},
        },
    },
}

trainer = Trainer(
    max_epochs=5, project_name="test", logger="wandb", experiment_name="test"
)

trainer.sweep(architecture, datamodule, task, sweep_config)